Team Members : Dario Prawara Teh Wei Rong (2201858) | Lim Zhen Yang (2214506)
Reinforcement Learning (RL) is a type of machine learning that focuses on training agents to make decisions in an environment by maximizing a reward signal. The roots of RL actually stems all the way back to the 1930s and 40s, where Skinner presented his experimental research on the behaviour of animals. He described the concept of "operant conditioning" which involved manipulating the consequences of an animal's behaviour, in order to change the likelihood that the behaviour would occur in the future. (Skinner, 1991)
For example, one of his most famous experiments was the "Skinner Box" experiment, which studied operant conditioning. In this experiment, Skinner placed a rat in a box with a lever and a food dispenser, and demonstrated how the rat learned to press the lever to receive a food reward. This experiment helped Skinner develop his theory of operant conditioning, which states that behavior is shaped by consequences (rewards and punishments) that follow the behavior.
Any goal can be formalized as the outcome of maximizing a cumulative reward - Hado van Hasselt, DeepMind.com
Each of the algorithms revolve around an agent that plays in an environment. There are a few different types of components that an agent can contain. These are:

Image
Credits: deepmind.com
When an agent is initialized and put into a new environment, the optimal actions it should take are essentially random, in that the agent does not possess any knowledge of what to do, or what the task to tackle even is. Only when it interacts with environment, gain knowledge from data and learn the optimal actions, does it improve. However, "reliance on data" could possibly lead to two different scenarios. (Wang, Zariphopoulou and Zhou, 2019)
Exploitation: The agent learns that a certain action returns some reward. Because the goal is to maximize the total reward, the agent then continuous to maximize the reward by repeatedly exploiting this specific knowledge or performing this move. As one can imagine, if the agent has not ultimately visited a large enough action space, this knowledge may lead to a suboptimal policy (Wiering, 1999).
Exploration: Take actions that currently don't possess the maximum expected reward, to learn more about the environment and realize better options for the future. However, the agent focusing solely on learning new knowledge, will lead to a potential waste of resource, time and opportunities.
Thus, the agent must learn to balance the trade-off between the exploring and exploiting, to learn the actions that will ultimately lead to the maximum optimal policy.
What are some approaches to tackle this issue? The simplest way is to randomly choose; every move there is a 50% chance to explore, and the other 50% to exploit. One may then realize that infact, a much smarter move would be have some sort of parameter epsilon ϵ, that controls the probability to exploit, with the probability to explore being 1 - ϵ. By doing this, ϵ can now be tuned to maximize the policy, which empirically is much. (Bather, 1990)
Usually, unlike in Supervised Learning, agents do not get immediate feedback on a per action basis. Rather, the reward system is attributed towards a sequence of actions. This means that agents must be considerate of the possibility that taking greedy approaches (essentially trying to retrieve immediate rewards) may result in less future reward.
It can be used to optimize decision making in systems where the decision maker does not have complete information about the system or the consequences of its actions. Additionally, it may be used to control systems that are difficult to model completely under mathematical equations, such as robots that must operate in uncertain environments. RL can also be used in control systems like robotics, games and autonomous systems.
For example, Boston Dynamics has used reinforcement learning to train its robots to balance and walk on rough terrain, such as rocks or uneven surfaces. The robots receive rewards for maintaining balance and penalties for falling over, allowing them to learn to walk more stably and efficiently over time.

Boston
Dynamics Robot (Image Credits: bostondynamics.com)
RL has proven to be a powerful tool for Boston Dynamics in their development of advanced robots, allowing them to perform complex and dynamic tasks in real-world environments with greater stability and robustness. (Pineda-Villavicencio, Ugon and Yost, 2018)
Before we begin, let us take a look at our project's objective.
Using OpenAI Gym, apply a suitable modification of deep Q-network (DQN) architecture to the problem. The model must exert some appropriate torque on the pendulum to balance it.
Pendulum is part of the five classic control environments. They are stochastic in terms of their initial state, within a given range.

The inverted pendulum swingup problem is based on the classic problem in control theory. The system consists of a pendulum attached at one end to a fixed point, and the other end being free. The pendulum starts in a random position and the goal is to apply torque on the free end to swing it into an upright position, with its center of gravity right above the fixed point.
Action Space - The pendulum can only perform one
action (torque).
ndarray with shape (1,) representing the
torque applied to free end of the pendulum with a range from -2.0 to
2.0.Observation Space - There are a total of 3
distinct components in the observation space.
Rewards Granted - For each time step, the reward
:
An episode is considered successful if it achieves a minimum cumulative reward of -16.2736044 (the minimum possible reward) or a maximum reward of 0, representing the pendulum being perfectly upright and balanced (no torque applied).
The pendulum starts at a random angle in [-pi, pi] and a random angular velocity in [-1, 1] and the episode truncates at 200 time steps.
Import necessary libraries for pre-processing, data exploration, feature engineering and model evaluation.
Some libraries used include pytorch, numpy, matplotlib, and gym.
# Import the necessary modules and libraries
# Gym and Environment Handling
import gym
# Numerical and Visualization Libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from matplotlib import animation, rc
import seaborn as sns
from torchinfo import summary
# Display and Visualization
from IPython import display as ipythondisplay
from pyvirtualdisplay.display import Display
from IPython.display import clear_output, display
# PyTorch for Neural Networks and Optimization
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
import torch.distributions as distributions
from torch.distributions import Normal
# Utility and Miscellaneous
import os
import random
import copy
import datetime
from collections import deque, namedtuple
# Hyperparameter tuning
from ray import tune, train
from ray.train import Checkpoint, session
from ray.tune.schedulers import ASHAScheduler
from functools import partial
import tempfile
# Ignore warnings
import warnings
warnings.filterwarnings("ignore")c:\Users\zzhen\anaconda3\envs\gpu_env\lib\site-packages\ray\data\__init__.py:4: DeprecationWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html
from pkg_resources._vendor.packaging.version import parse as parse_version
c:\Users\zzhen\anaconda3\envs\gpu_env\lib\site-packages\pkg_resources\__init__.py:2868: DeprecationWarning: Deprecated call to `pkg_resources.declare_namespace('mpl_toolkits')`.
Implementing implicit namespace packages (as specified in PEP 420) is preferred to `pkg_resources.declare_namespace`. See https://setuptools.pypa.io/en/latest/references/keywords.html#keyword-namespace-packages
declare_namespace(pkg)
c:\Users\zzhen\anaconda3\envs\gpu_env\lib\site-packages\pkg_resources\__init__.py:2868: DeprecationWarning: Deprecated call to `pkg_resources.declare_namespace('google')`.
Implementing implicit namespace packages (as specified in PEP 420) is preferred to `pkg_resources.declare_namespace`. See https://setuptools.pypa.io/en/latest/references/keywords.html#keyword-namespace-packages
declare_namespace(pkg)
True, it means that the GPU is working as
expected for PyTorch.torch.cuda.is_available()True
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
print(device)cuda:0
plot_agent_performance, which is used to plot the charts to
visualize changes in reward obtained.# Function to plot the performance of the model over time
def plot_agent_performance(scores, average_reward, model_name="Random Agent"):
"""
Plots the performance of an agent.
Parameters:
scores (list): A list of scores representing the agent's performance in each episode.
average_reward (float): The average reward across all episodes.
model_name (str): The name of the model/agent.
"""
# Creating subplots: 1 row, 2 columns
plt.figure(figsize=(15, 6))
# First subplot: Reward over Episodes
plt.subplot(1, 2, 1)
plt.plot(scores, label='Reward per Episode')
plt.axhline(y=average_reward, color='r', linestyle='-', label='Average Reward')
plt.xlabel('Episode')
plt.ylabel('Total Reward')
plt.title(f'Reward over Episodes for {model_name}')
plt.legend()
# Second subplot: Histogram of Rewards
plt.subplot(1, 2, 2)
plt.hist(scores, bins=20, alpha=0.7)
plt.axvline(x=average_reward, color='r', linestyle='-', label='Average Reward')
plt.xlabel('Total Reward')
plt.ylabel('Frequency')
plt.title(f'Distribution of Rewards for {model_name}')
plt.legend()
# Display the subplots
plt.tight_layout()
plt.show()
# Creating an animation function
def create_animation(frames, filename=None):
rc("animation", html="jshtml")
fig = plt.figure()
plt.axis("off")
im = plt.imshow(frames[0], animated=True)
def updatefig(i):
im.set_array(frames[i])
return im,
animationFig = animation.FuncAnimation(fig, updatefig, frames=len(frames), interval=len(frames)/10, blit=True, repeat=False)
ipythondisplay.display(ipythondisplay.HTML(animationFig.to_html5_video()))
if filename != None:
animationFig.save(filename, writer='imagemagick')
return animationFig
# Function to test agent weights
def test_agent(agent, type):
env = gym.make('Pendulum-v1', g=9.81)
frames = []
state = env.reset()
done = False
cumulative_reward = 0 # Initialize cumulative reward
while not done:
if type == 'SAC':
action, _ = agent.choose_action(torch.FloatTensor(state))
state_prime, reward, done, _ = env.step([action])
else:
action = agent.choose_action(torch.FloatTensor(state))
state_prime, reward, done, _ = env.step([action])
cumulative_reward += reward # Accumulate reward
state = state_prime
screen = env.render(mode='rgb_array')
frames.append(screen)
env.close()
print(f'Test reward: {cumulative_reward}') # Print cumulative reward
create_animation(frames)
# Initialize the RunningCalc class
class RunningCalc:
class Node:
def __init__(self, val):
self.val = val
self.next = None
def __init__(self, limit=10):
self.head = None
self.tail = None
self.count = 0
self.limit = limit
self.total = 0
def add(self, val):
self.count += 1
if self.count > self.limit:
self.total -= self.head.val
self.head = self.head.next
self.count -= 1
if self.head is None and self.tail is None:
self.head = self.Node(val)
self.tail = self.head
else:
newNode = self.Node(val)
self.tail.next = newNode
self.tail = newNode
self.total += val
def calc(self):
return self.total
# Initalize the Tracker class to track rewards over time
class Tracker:
def __init__(self):
self.running = {}
self.reward = {}
self.success = {}
self.name = None
def add(self, name, running, reward, success_rate):
if name in self.running.keys():
self.running[name].append(running)
else:
self.running[name] = [running]
if name in self.reward.keys():
self.reward[name].append(reward)
else:
self.reward[name] = [reward]
if name in self.success.keys():
self.success[name].append(success_rate)
else:
self.success[name] = [reward]
print(f"{name} | Running 200 Reward: {running} | Reward: {reward} | Running Success Rate: {success_rate} ")
def plot(self, name, metric):
fig = plt.figure()
fig.suptitle(f"{name} | {metric}")
ax = fig.subplots()
if metric == 'success':
ax.plot(self.success[name])
else:
ax.plot([200 for i in range(len(self.reward[name]))], label='Solve', linestyle='--')
ax.plot(self.reward[name], label='Reward', color=sns.color_palette('pastel')[0])
ax.plot(self.running[name], label='Running', color=sns.color_palette('pastel')[1], linestyle='--')
plt.legend()
def plot_all(self, metric):
fig = plt.figure()
ax = fig.subplots()
ax.set_xlabel("Episodes (in 20s)")
if metric == 'success':
fig.suptitle("All Success")
score = self.success
for i, name in enumerate(list(sorted(self.reward.keys()))):
ax.plot(self.success[name], label=f'{name}', color=sns.color_palette('Paired')[1 + i * 2])
ax.set_ylabel("Success Rate")
plt.legend()
elif metric == 'reward':
fig.suptitle("All Rewards")
first = list(self.reward.keys())[0]
ax.plot([200 for i in range(len(self.reward[first]))], label='Solve', linestyle='--')
for i, name in enumerate(list(sorted(self.reward.keys()))):
ax.plot(self.running[name], label=f'{name}', color=sns.color_palette('Paired')[1 + i * 2])
ax.plot(self.reward[name], color=sns.color_palette('Paired')[0 + i * 2], linestyle='--')
ax.set_ylabel("Episode Reward")
plt.legend()# Change theme of charts
sns.set_theme(style='darkgrid')
# Change font of charts
sns.set(font='Century Gothic')
# Variable for color palettes
color_palette = sns.color_palette('muted')To visualize what the animation looks like, we will be displaying the
environment by running 200 time steps for the pendulum
using the provided gym.make("Pendulum-v1") import
statement.
# Setting up the environment
env = gym.make('Pendulum-v1', g=9.81)
env.action_space.seed(42)
env.reset()
# Defining the frames for 200 time steps
frames = []
for i in range(200):
action = env.action_space.sample()
obs, reward, done, info = env.step(action)
screen = env.render(mode='rgb_array')
frames.append(screen)
if done:
break
env.close()
create_animation(frames)
To implement the pendulum's dynamic equations, we will be utilizing the pendulum's coordinate system as shown below :
x-y : cartesian coordinates of the pendulum's end in
meters.theta : angle in radians.tau : torque in Nm. Defined as positive
counter-clockwise.First, we will conduct some simple exploratory data analysis (EDA) of the pendulum environment, allowing us to better understand the different actions and how they affect the pendulum's movement. Some things we are looking at will include :
OBSERVATION SPACE ANALYSIS
obs_low, the values -1, -1, -8 represent the
smallest possible values for each of the 3 dimensions, which are the
x-coord, y-coord and angular velocity. (Same for obs_high
with the values 1, 1, 8).# Finding the minimum and maximum allowable values for each dimension of observation
obs_low = env.observation_space.low
obs_high = env.observation_space.high
print('Number of Observation Space: ', env.observation_space.shape)
print("Observation Space Low:", obs_low)
print("Observation Space High:", obs_high)Number of Observation Space: (3,)
Observation Space Low: [-1. -1. -8.]
Observation Space High: [1. 1. 8.]
ACTION SPACE ANALYSIS
print('Number of Actions: ', env.action_space)Number of Actions: Box(-2.0, 2.0, (1,), float32)
TESTING ACTIONS AND ITS EFFECTS ON THE PENDULUM
Now, we will be looking into how each action can affect the pendulum. In the case of the pendulum, there are no discrete actions (meaning that the actions the pendulum can perform are infinite). Hence, we have selected just 5 types of actions the pendulum could possibly make and will be looking more into these specific actions for our EDA :
ACTION 1 : ZERO TORQUE
# Zero Torque
env = gym.make('Pendulum-v1', g=9.81)
env.action_space.seed(42)
env.reset()
# Defining the frames for 300 time steps
frames = []
for i in range(300):
obs, reward, done, info = env.step([0.0])
screen = env.render(mode='rgb_array')
frames.append(screen)
if done:
break
env.close()
create_animation(frames)
ACTION 2 : POSITIVE TORQUE [2.0]
# Positive Torque
env = gym.make('Pendulum-v1', g=9.81)
env.action_space.seed(42)
env.reset()
# Defining the frames for 300 time steps
frames = []
for i in range(300):
obs, reward, done, info = env.step([2.0])
screen = env.render(mode='rgb_array')
frames.append(screen)
if done:
break
env.close()
create_animation(frames)
ACTION 3 : NEGATIVE TORQUE [-2.0]
# Negative Torque
env = gym.make('Pendulum-v1', g=9.81)
env.action_space.seed(42)
env.reset()
# Defining the frames for 300 time steps
frames = []
for i in range(300):
obs, reward, done, info = env.step([-2.0])
screen = env.render(mode='rgb_array')
frames.append(screen)
if done:
break
env.close()
create_animation(frames)
ACTION 4 : GRADUAL INCREASE IN TORQUE
# Gradual Increase in Torque
env = gym.make('Pendulum-v1', g=9.81)
env.action_space.seed(42)
env.reset()
# Defining the frames for 300 time steps
frames = []
for i in range(300):
obs, reward, done, info = env.step([-2.0 + (i * 0.08)])
screen = env.render(mode='rgb_array')
frames.append(screen)
if done:
break
env.close()
create_animation(frames)
ACTION 5 : GRADUAL DECREASE IN TORQUE
# Gradual Decrease in Torque
env = gym.make('Pendulum-v1', g=9.81)
env.action_space.seed(42)
env.reset()
# Defining the frames for 300 time steps
frames = []
for i in range(300):
obs, reward, done, info = env.step([2.0 - (i * 0.08)])
screen = env.render(mode='rgb_array')
frames.append(screen)
if done:
break
env.close()
create_animation(frames)
SUMMARY ANALYSIS OF TORQUE MOVEMENTS
We found that applying torque to the pendulum triggers substantial changes in its swinging behavior. High positive torque leads to forceful swings away from the natural downward position, resulting in decreased stability and penalties within the reward system. Conversely, negative torque slows movement in the opposite direction, potentially aiding stability, yet limited effectiveness incurs penalties due to deviations from the desired position.
This could indicate that lower torque may provide higher rewards, as it encourages stability in the pendulum's movement.
Moreover, the reward system penalizes excessive movement, high velocities, and deviations from the desired stable state caused by high torque, resulting in reduced overall rewards. However, gradual changes in torque offer opportunities for systematic exploration, aiding in learning and potentially optimizing strategies for balancing the pendulum while minimizing the penalties incurred in the rewards system.
Upon gathering insights from our EDA, we will now be proceeding to build and test a few reinforcement learning models to help balance the pendulum by exerting an appropriate level of torque.
We will be testing with the following models :
In this RL analysis, we will be diving deeper into DQN-related architectures compared to other models to demonstrate its viability in solving the Pendulum task.
For random action model, it is a baseline as it operates by making decisions solely based on random selection from the available action space, and does not take in any considerations related to the environment's state or learning strategies.
This model will serve as a fundamental benchmark for us to evaluate the performance of more advanced models later on, such as Deep Q Network.
CREATING AN AGENT THAT TAKES RANDOM ACTIONS
Due to the possibility that the pendulum's episode may go on forever, we will set a fixed limit to the number of steps per episode for the pendulum at 200, to prevent the episode from running indefinitely. We will be setting our number of episodes to 800 to give us a benchmark of how well our next few models should perform.
# Create the Gym environment for Pendulum with specified gravity and render mode
env = gym.make('Pendulum-v1', g=9.81)
env.action_space.seed(42)
# Initialize an array to store scores for visualization
total_rewards = []
frames = []
# Define the maximum number of episodes and steps per episode
MAX_EPISODES = 800
MAX_STEP_PER_EPISODE = 200
# Loop through the episodes using a for loop
for i in range(MAX_EPISODES):
state = env.reset()
total_reward = 0
done = False
start_time = datetime.datetime.now()
# Loop through the maximum steps per episode
for step in range(MAX_STEP_PER_EPISODE):
action = env.action_space.sample() # Select a random action from the action space
state, reward, done, info = env.step(action) # Apply the action and observe the result
total_reward += reward
if step % 30 == 0 and total_reward > -50:
screen = env.render(mode='rgb_array')
frames.append(screen)
if done:
break
elapsed_time = datetime.datetime.now() - start_time
if i % 10 == 0:
print('Episode {:>4} | Total Reward: {:>8.2f} | Elapsed: {}'.format(i, total_reward, elapsed_time))
total_rewards.append(total_reward)
# Close the environment
env.close()Episode 0 | Total Reward: -1802.22 | Elapsed: 0:00:00.411252
Episode 10 | Total Reward: -1321.27 | Elapsed: 0:00:00.015025
Episode 20 | Total Reward: -1291.68 | Elapsed: 0:00:00.015692
Episode 30 | Total Reward: -992.26 | Elapsed: 0:00:00.015252
Episode 40 | Total Reward: -1534.34 | Elapsed: 0:00:00.015068
Episode 50 | Total Reward: -1617.21 | Elapsed: 0:00:00.015047
Episode 60 | Total Reward: -1170.23 | Elapsed: 0:00:00.013512
Episode 70 | Total Reward: -1198.94 | Elapsed: 0:00:00.014853
Episode 80 | Total Reward: -1304.78 | Elapsed: 0:00:00.015785
Episode 90 | Total Reward: -903.80 | Elapsed: 0:00:00.014683
Episode 100 | Total Reward: -886.82 | Elapsed: 0:00:00.013750
Episode 110 | Total Reward: -894.23 | Elapsed: 0:00:00.013363
Episode 120 | Total Reward: -755.78 | Elapsed: 0:00:00.019042
Episode 130 | Total Reward: -917.89 | Elapsed: 0:00:00.015532
Episode 140 | Total Reward: -1167.00 | Elapsed: 0:00:00.014530
Episode 150 | Total Reward: -1189.97 | Elapsed: 0:00:00.020365
Episode 160 | Total Reward: -1182.69 | Elapsed: 0:00:00.014773
Episode 170 | Total Reward: -1019.11 | Elapsed: 0:00:00.016037
Episode 180 | Total Reward: -969.14 | Elapsed: 0:00:00.016114
Episode 190 | Total Reward: -1060.26 | Elapsed: 0:00:00.013517
Episode 200 | Total Reward: -900.67 | Elapsed: 0:00:00.018020
Episode 210 | Total Reward: -1054.46 | Elapsed: 0:00:00.015009
Episode 220 | Total Reward: -1071.76 | Elapsed: 0:00:00.016130
Episode 230 | Total Reward: -1291.16 | Elapsed: 0:00:00.016550
Episode 240 | Total Reward: -964.53 | Elapsed: 0:00:00.014381
Episode 250 | Total Reward: -1696.45 | Elapsed: 0:00:00.018044
Episode 260 | Total Reward: -1546.35 | Elapsed: 0:00:00.014513
Episode 270 | Total Reward: -967.59 | Elapsed: 0:00:00.014515
Episode 280 | Total Reward: -1330.98 | Elapsed: 0:00:00.015257
Episode 290 | Total Reward: -1276.31 | Elapsed: 0:00:00.025327
Episode 300 | Total Reward: -1448.81 | Elapsed: 0:00:00.019039
Episode 310 | Total Reward: -969.73 | Elapsed: 0:00:00.016753
Episode 320 | Total Reward: -917.34 | Elapsed: 0:00:00.027612
Episode 330 | Total Reward: -992.74 | Elapsed: 0:00:00.019041
Episode 340 | Total Reward: -997.48 | Elapsed: 0:00:00.015070
Episode 350 | Total Reward: -1359.94 | Elapsed: 0:00:00.015710
Episode 360 | Total Reward: -1217.04 | Elapsed: 0:00:00.015013
Episode 370 | Total Reward: -1333.30 | Elapsed: 0:00:00.017028
Episode 380 | Total Reward: -972.93 | Elapsed: 0:00:00.015113
Episode 390 | Total Reward: -927.15 | Elapsed: 0:00:00.015857
Episode 400 | Total Reward: -1402.74 | Elapsed: 0:00:00.014513
Episode 410 | Total Reward: -866.96 | Elapsed: 0:00:00.016380
Episode 420 | Total Reward: -868.44 | Elapsed: 0:00:00.014042
Episode 430 | Total Reward: -892.04 | Elapsed: 0:00:00.015070
Episode 440 | Total Reward: -1345.45 | Elapsed: 0:00:00.013859
Episode 450 | Total Reward: -1051.27 | Elapsed: 0:00:00.015856
Episode 460 | Total Reward: -1476.64 | Elapsed: 0:00:00.014025
Episode 470 | Total Reward: -1347.09 | Elapsed: 0:00:00.015376
Episode 480 | Total Reward: -1427.48 | Elapsed: 0:00:00.015203
Episode 490 | Total Reward: -1189.14 | Elapsed: 0:00:00.015038
Episode 500 | Total Reward: -1500.24 | Elapsed: 0:00:00.014024
Episode 510 | Total Reward: -1488.33 | Elapsed: 0:00:00.016121
Episode 520 | Total Reward: -939.01 | Elapsed: 0:00:00.014393
Episode 530 | Total Reward: -1673.15 | Elapsed: 0:00:00.014360
Episode 540 | Total Reward: -1288.93 | Elapsed: 0:00:00.015143
Episode 550 | Total Reward: -1458.60 | Elapsed: 0:00:00.015359
Episode 560 | Total Reward: -1403.01 | Elapsed: 0:00:00.014623
Episode 570 | Total Reward: -1292.03 | Elapsed: 0:00:00.015744
Episode 580 | Total Reward: -849.16 | Elapsed: 0:00:00.015180
Episode 590 | Total Reward: -1720.54 | Elapsed: 0:00:00.015425
Episode 600 | Total Reward: -773.16 | Elapsed: 0:00:00.013536
Episode 610 | Total Reward: -766.59 | Elapsed: 0:00:00.014706
Episode 620 | Total Reward: -1544.38 | Elapsed: 0:00:00.015905
Episode 630 | Total Reward: -1449.55 | Elapsed: 0:00:00.014895
Episode 640 | Total Reward: -1339.64 | Elapsed: 0:00:00.015521
Episode 650 | Total Reward: -829.12 | Elapsed: 0:00:00.015473
Episode 660 | Total Reward: -1444.76 | Elapsed: 0:00:00.015420
Episode 670 | Total Reward: -910.04 | Elapsed: 0:00:00.017223
Episode 680 | Total Reward: -753.93 | Elapsed: 0:00:00.014641
Episode 690 | Total Reward: -1520.10 | Elapsed: 0:00:00.015305
Episode 700 | Total Reward: -1487.44 | Elapsed: 0:00:00.015518
Episode 710 | Total Reward: -1651.39 | Elapsed: 0:00:00.014301
Episode 720 | Total Reward: -758.29 | Elapsed: 0:00:00.015094
Episode 730 | Total Reward: -1146.98 | Elapsed: 0:00:00.015406
Episode 740 | Total Reward: -1266.38 | Elapsed: 0:00:00.014977
Episode 750 | Total Reward: -1441.06 | Elapsed: 0:00:00.015604
Episode 760 | Total Reward: -882.69 | Elapsed: 0:00:00.015686
Episode 770 | Total Reward: -1009.75 | Elapsed: 0:00:00.014754
Episode 780 | Total Reward: -912.81 | Elapsed: 0:00:00.015169
Episode 790 | Total Reward: -1045.95 | Elapsed: 0:00:00.014151
VISUALIZING THE PERFORMANCE OF RANDOM AGENT MODEL
# Calculating statistical measures
average_reward = np.mean(total_rewards)
median_reward = np.median(total_rewards)
max_reward = np.max(total_rewards)
min_reward = np.min(total_rewards)
# Identifying the best episode
best_episode_index = np.argmax(total_rewards)
# Neatly formatted output
print("Performance Statistics for the Random Agent:")
print("--------------------------------------------")
print(f"Best Episode : {best_episode_index}")
print(f"Average Reward : {average_reward:.2f}")
print(f"Median Reward : {median_reward:.2f}")
print(f"Maximum Reward : {max_reward:.2f}")
print(f"Minimum Reward : {min_reward:.2f}")
# Plot the charts to show performance over time
plot_agent_performance(total_rewards, average_reward, model_name="Random Agent")Performance Statistics for the Random Agent:
--------------------------------------------
Best Episode : 61
Average Reward : -1219.34
Median Reward : -1179.79
Maximum Reward : -728.41
Minimum Reward : -1830.10

VISUALIZING THE PENDULUM ANIMATION FOR THE RANDOM ACTION MODEL
create_animation(frames)
DQN (Deep Q-Network) is a reinforcement learning algorithm that combines Q-Learning with deep neural networks to estimate the Q-value function. The goal of DQN is to find a policy that maximizes the expected cumulative reward in an environment, by using the neural network to approximate the Q-value for each possible action in a given state. This allows DQN to scale to high-dimensional state spaces and solve more complex problems than traditional Q-Learning methods.
In reinforcement learning, the Q-value function represents the expected cumulative reward from taking a certain action in a certain state and following a specific policy thereafter. DQN uses a neural network to approximate the Q-value function and make decisions about which action to take in each state. The network is trained on a dataset of state-action-reward transitions generated by interacting with the environment. The training process updates the network weights so that the estimated Q-values for each action become more accurate over time.
One key innovation of DQN is the use of experience replay, which is a technique for storing and reusing previously observed state-action-reward transitions to decorrelate the samples and improve the stability of the learning process. Another important aspect of DQN is the use of target networks, which are separate networks that are used to stabilize the training of the primary network. The target network's weights are updated less frequently than the primary network's weights, which helps prevent overfitting and stabilize the learning process.
INITALIZING AND CREATING THE REPLAYBUFFER CLASS
ReplayBuffer class, which serves as a memory storage
system in RL tasks.class ReplayBuffer:
def __init__(self, buffer_limit):
self.buffer = deque(maxlen=buffer_limit)
def put(self, transition):
self.buffer.append(transition)
def sample(self, n):
mini_batch = random.sample(self.buffer, n)
s_lst, a_lst, r_lst, s_prime_lst, done_mask_lst = [], [], [], [], []
for transition in mini_batch:
s, a, r, s_prime, done = transition
s_lst.append(s)
a_lst.append([a])
r_lst.append([r])
s_prime_lst.append(s_prime)
done_mask = 0.0 if done else 1.0
done_mask_lst.append([done_mask])
s_batch = torch.tensor(s_lst, dtype=torch.float)
a_batch = torch.tensor(a_lst, dtype=torch.float)
r_batch = torch.tensor(r_lst, dtype=torch.float)
s_prime_batch = torch.tensor(s_prime_lst, dtype=torch.float)
done_batch = torch.tensor(done_mask_lst, dtype=torch.float)
return s_batch, a_batch, r_batch, s_prime_batch, done_batch
def size(self):
return len(self.buffer)SETTING UP THE MODEL ARCHITECTURE FOR THE SIMPLE DQN MODEL
# Defining the QNetwork class for the DQN Agent
class QNetwork(nn.Module):
def __init__(self, state_dim, action_dim, q_lr):
super(QNetwork, self).__init__()
self.fc_1 = nn.Linear(state_dim, 64)
self.fc_2 = nn.Linear(64, 32)
self.fc_out = nn.Linear(32, action_dim)
self.lr = q_lr
self.optimizer = optim.Adam(self.parameters(), lr=self.lr)
def forward(self, x):
q = F.leaky_relu(self.fc_1(x))
q = F.leaky_relu(self.fc_2(q))
q = self.fc_out(q)
return q
# Creating a class for the DQN Agent
class DQNAgent:
def __init__(self):
self.state_dim = 3
self.action_dim = 9
self.lr = 0.01
self.gamma = 0.98
self.tau = 0.01
self.epsilon = 1.0
self.epsilon_decay = 0.98
self.epsilon_min = 0.001
self.buffer_size = 100000
self.batch_size = 200
self.memory = ReplayBuffer(self.buffer_size)
self.Q = QNetwork(self.state_dim, self.action_dim, self.lr)
self.Q_target = QNetwork(self.state_dim, self.action_dim, self.lr)
self.Q_target.load_state_dict(self.Q.state_dict())
def choose_action(self, state):
random_number = np.random.rand()
maxQ_action_count = 0
if self.epsilon < random_number:
with torch.no_grad():
action = float(torch.argmax(self.Q(state)).numpy())
real_action = (action - 4) / 4
maxQ_action_count = 1
else:
action = np.random.choice([n for n in range(9)])
real_action = (action - 4) / 2
return action, real_action, maxQ_action_count
def calc_target(self, mini_batch):
s, a, r, s_prime, done = mini_batch
with torch.no_grad():
q_target = self.Q_target(s_prime).max(1)[0].unsqueeze(1)
target = r + self.gamma * done * q_target
return target
def train_agent(self):
mini_batch = self.memory.sample(self.batch_size)
s_batch, a_batch, r_batch, s_prime_batch, done_batch = mini_batch
a_batch = a_batch.type(torch.int64)
td_target = self.calc_target(mini_batch)
# QNetwork training
Q_a = self.Q(s_batch).gather(1, a_batch)
q_loss = F.smooth_l1_loss(Q_a, td_target)
self.Q.optimizer.zero_grad()
q_loss.mean().backward()
self.Q.optimizer.step()
# QNetwork Soft Update
for param_target, param in zip(self.Q_target.parameters(), self.Q.parameters()):
param_target.data.copy_(param_target.data * (1.0 - self.tau) + param.data * self.tau)def train_DQNAgent():
# Initalize the DQN Agent and related variables required
agent = DQNAgent()
env = gym.make('Pendulum-v1', g=9.81)
episodes = 800
total_rewards = []
frames = []
no_of_steps = []
success_count = 0
best_episode = 0
best_reward = float('-inf')
# Loop through the range of episodes
for episode in range(episodes):
state = env.reset()
score, done = 0.0, False
maxQ_action_count = 0
start_time = datetime.datetime.now()
while not done:
action, real_action, count = agent.choose_action(torch.FloatTensor(state))
state_prime, reward, done, _ = env.step([real_action])
agent.memory.put((state, action, reward, state_prime, done))
score += reward
maxQ_action_count += count
state = state_prime
if maxQ_action_count % 100 == 0 and score > -50:
screen = env.render(mode='rgb_array')
frames.append(screen)
if agent.memory.size() > 1000:
agent.train_agent()
# Recording results
if len(total_rewards) > 0:
success_count += (score - total_rewards[-1]) >= 200
total_rewards.append(score)
no_of_steps.append(maxQ_action_count)
if score > best_reward:
best_reward = score
best_episode = episode
# Saving the Models
save_folder = "DQN"
if not os.path.exists(save_folder):
os.makedirs(save_folder)
if episode == best_episode:
model_Q = os.path.join(save_folder, "DQN" + str(episode) + ".pt")
torch.save(agent.Q.state_dict(), model_Q)
if episode % 10 == 0:
elapsed_time = datetime.datetime.now() - start_time
print('Episode {:>4} | Total Reward: {:>8.2f} | MaxQ_Action_Count:{:>5} | Epsilon: {:>4.4f} | Elapsed: {}'.format(episode, score, maxQ_action_count, agent.epsilon, elapsed_time))
if agent.epsilon > agent.epsilon_min:
agent.epsilon *= agent.epsilon_decay
env.close()
return {
'total_rewards': total_rewards,
'no_of_steps': no_of_steps,
'success_count': success_count,
'frames': frames
}
DQN_results = train_DQNAgent()Episode 0 | Total Reward: -1442.58 | MaxQ_Action_Count: 0 | Epsilon: 1.0000 | Elapsed: 0:00:00.581677
Episode 10 | Total Reward: -875.69 | MaxQ_Action_Count: 34 | Epsilon: 0.8171 | Elapsed: 0:00:00.584059
Episode 20 | Total Reward: -894.80 | MaxQ_Action_Count: 66 | Epsilon: 0.6676 | Elapsed: 0:00:00.685368
Episode 30 | Total Reward: -889.94 | MaxQ_Action_Count: 91 | Epsilon: 0.5455 | Elapsed: 0:00:00.568242
Episode 40 | Total Reward: -379.12 | MaxQ_Action_Count: 124 | Epsilon: 0.4457 | Elapsed: 0:00:00.567993
Episode 50 | Total Reward: -490.62 | MaxQ_Action_Count: 132 | Epsilon: 0.3642 | Elapsed: 0:00:00.650391
Episode 60 | Total Reward: -376.69 | MaxQ_Action_Count: 143 | Epsilon: 0.2976 | Elapsed: 0:00:00.580481
Episode 70 | Total Reward: -373.07 | MaxQ_Action_Count: 154 | Epsilon: 0.2431 | Elapsed: 0:00:00.584176
Episode 80 | Total Reward: -124.28 | MaxQ_Action_Count: 162 | Epsilon: 0.1986 | Elapsed: 0:00:00.567040
Episode 90 | Total Reward: -892.51 | MaxQ_Action_Count: 163 | Epsilon: 0.1623 | Elapsed: 0:00:00.585132
Episode 100 | Total Reward: -365.75 | MaxQ_Action_Count: 172 | Epsilon: 0.1326 | Elapsed: 0:00:00.581452
Episode 110 | Total Reward: -124.99 | MaxQ_Action_Count: 186 | Epsilon: 0.1084 | Elapsed: 0:00:00.577302
Episode 120 | Total Reward: -251.45 | MaxQ_Action_Count: 189 | Epsilon: 0.0885 | Elapsed: 0:00:00.607042
Episode 130 | Total Reward: -615.79 | MaxQ_Action_Count: 186 | Epsilon: 0.0723 | Elapsed: 0:00:00.617960
Episode 140 | Total Reward: -252.02 | MaxQ_Action_Count: 190 | Epsilon: 0.0591 | Elapsed: 0:00:00.432549
Episode 150 | Total Reward: -245.99 | MaxQ_Action_Count: 192 | Epsilon: 0.0483 | Elapsed: 0:00:00.503967
Episode 160 | Total Reward: -124.51 | MaxQ_Action_Count: 191 | Epsilon: 0.0395 | Elapsed: 0:00:00.372844
Episode 170 | Total Reward: -122.16 | MaxQ_Action_Count: 193 | Epsilon: 0.0322 | Elapsed: 0:00:00.554304
Episode 180 | Total Reward: -238.61 | MaxQ_Action_Count: 196 | Epsilon: 0.0263 | Elapsed: 0:00:00.478303
Episode 190 | Total Reward: -492.15 | MaxQ_Action_Count: 197 | Epsilon: 0.0215 | Elapsed: 0:00:00.633898
Episode 200 | Total Reward: -124.90 | MaxQ_Action_Count: 199 | Epsilon: 0.0176 | Elapsed: 0:00:00.591285
Episode 210 | Total Reward: -244.92 | MaxQ_Action_Count: 197 | Epsilon: 0.0144 | Elapsed: 0:00:00.589859
Episode 220 | Total Reward: -1.64 | MaxQ_Action_Count: 200 | Epsilon: 0.0117 | Elapsed: 0:00:00.633756
Episode 230 | Total Reward: -357.21 | MaxQ_Action_Count: 198 | Epsilon: 0.0096 | Elapsed: 0:00:00.639592
Episode 240 | Total Reward: -1.74 | MaxQ_Action_Count: 200 | Epsilon: 0.0078 | Elapsed: 0:00:00.656186
Episode 250 | Total Reward: -245.18 | MaxQ_Action_Count: 199 | Epsilon: 0.0064 | Elapsed: 0:00:00.657843
Episode 260 | Total Reward: -236.65 | MaxQ_Action_Count: 199 | Epsilon: 0.0052 | Elapsed: 0:00:00.621576
Episode 270 | Total Reward: -367.15 | MaxQ_Action_Count: 200 | Epsilon: 0.0043 | Elapsed: 0:00:00.773133
Episode 280 | Total Reward: -237.50 | MaxQ_Action_Count: 198 | Epsilon: 0.0035 | Elapsed: 0:00:00.638295
Episode 290 | Total Reward: -2.32 | MaxQ_Action_Count: 200 | Epsilon: 0.0029 | Elapsed: 0:00:00.645979
Episode 300 | Total Reward: -729.79 | MaxQ_Action_Count: 200 | Epsilon: 0.0023 | Elapsed: 0:00:00.617639
Episode 310 | Total Reward: -754.04 | MaxQ_Action_Count: 199 | Epsilon: 0.0019 | Elapsed: 0:00:00.617213
Episode 320 | Total Reward: -608.00 | MaxQ_Action_Count: 200 | Epsilon: 0.0016 | Elapsed: 0:00:00.587593
Episode 330 | Total Reward: -127.17 | MaxQ_Action_Count: 200 | Epsilon: 0.0013 | Elapsed: 0:00:00.607517
Episode 340 | Total Reward: -238.76 | MaxQ_Action_Count: 200 | Epsilon: 0.0010 | Elapsed: 0:00:00.683272
Episode 350 | Total Reward: -1.74 | MaxQ_Action_Count: 200 | Epsilon: 0.0010 | Elapsed: 0:00:00.693084
Episode 360 | Total Reward: -2.97 | MaxQ_Action_Count: 199 | Epsilon: 0.0010 | Elapsed: 0:00:00.675632
Episode 370 | Total Reward: -247.95 | MaxQ_Action_Count: 200 | Epsilon: 0.0010 | Elapsed: 0:00:00.658638
Episode 380 | Total Reward: -121.05 | MaxQ_Action_Count: 200 | Epsilon: 0.0010 | Elapsed: 0:00:00.615279
Episode 390 | Total Reward: -369.79 | MaxQ_Action_Count: 200 | Epsilon: 0.0010 | Elapsed: 0:00:00.940296
Episode 400 | Total Reward: -674.26 | MaxQ_Action_Count: 199 | Epsilon: 0.0010 | Elapsed: 0:00:00.812862
Episode 410 | Total Reward: -122.86 | MaxQ_Action_Count: 200 | Epsilon: 0.0010 | Elapsed: 0:00:00.834430
Episode 420 | Total Reward: -126.15 | MaxQ_Action_Count: 200 | Epsilon: 0.0010 | Elapsed: 0:00:00.763558
Episode 430 | Total Reward: -125.37 | MaxQ_Action_Count: 200 | Epsilon: 0.0010 | Elapsed: 0:00:00.719899
Episode 440 | Total Reward: -123.28 | MaxQ_Action_Count: 200 | Epsilon: 0.0010 | Elapsed: 0:00:00.722948
Episode 450 | Total Reward: -366.78 | MaxQ_Action_Count: 199 | Epsilon: 0.0010 | Elapsed: 0:00:00.698313
Episode 460 | Total Reward: -2.89 | MaxQ_Action_Count: 200 | Epsilon: 0.0010 | Elapsed: 0:00:00.606581
Episode 470 | Total Reward: -2.53 | MaxQ_Action_Count: 200 | Epsilon: 0.0010 | Elapsed: 0:00:00.628375
Episode 480 | Total Reward: -245.80 | MaxQ_Action_Count: 199 | Epsilon: 0.0010 | Elapsed: 0:00:00.615340
Episode 490 | Total Reward: -126.18 | MaxQ_Action_Count: 200 | Epsilon: 0.0010 | Elapsed: 0:00:00.612532
Episode 500 | Total Reward: -126.28 | MaxQ_Action_Count: 200 | Epsilon: 0.0010 | Elapsed: 0:00:00.616444
Episode 510 | Total Reward: -127.19 | MaxQ_Action_Count: 200 | Epsilon: 0.0010 | Elapsed: 0:00:00.590390
Episode 520 | Total Reward: -125.42 | MaxQ_Action_Count: 200 | Epsilon: 0.0010 | Elapsed: 0:00:00.782460
Episode 530 | Total Reward: -374.73 | MaxQ_Action_Count: 200 | Epsilon: 0.0010 | Elapsed: 0:00:00.579272
Episode 540 | Total Reward: -484.54 | MaxQ_Action_Count: 200 | Epsilon: 0.0010 | Elapsed: 0:00:00.592744
Episode 550 | Total Reward: -125.92 | MaxQ_Action_Count: 200 | Epsilon: 0.0010 | Elapsed: 0:00:00.721086
Episode 560 | Total Reward: -124.02 | MaxQ_Action_Count: 200 | Epsilon: 0.0010 | Elapsed: 0:00:00.684132
Episode 570 | Total Reward: -354.03 | MaxQ_Action_Count: 200 | Epsilon: 0.0010 | Elapsed: 0:00:00.993130
Episode 580 | Total Reward: -366.26 | MaxQ_Action_Count: 200 | Epsilon: 0.0010 | Elapsed: 0:00:01.313442
Episode 590 | Total Reward: -122.87 | MaxQ_Action_Count: 200 | Epsilon: 0.0010 | Elapsed: 0:00:00.807287
Episode 600 | Total Reward: -123.00 | MaxQ_Action_Count: 199 | Epsilon: 0.0010 | Elapsed: 0:00:00.886123
Episode 610 | Total Reward: -128.40 | MaxQ_Action_Count: 200 | Epsilon: 0.0010 | Elapsed: 0:00:00.903556
Episode 620 | Total Reward: -129.08 | MaxQ_Action_Count: 200 | Epsilon: 0.0010 | Elapsed: 0:00:00.935696
Episode 630 | Total Reward: -485.23 | MaxQ_Action_Count: 200 | Epsilon: 0.0010 | Elapsed: 0:00:00.846403
Episode 640 | Total Reward: -127.66 | MaxQ_Action_Count: 200 | Epsilon: 0.0010 | Elapsed: 0:00:00.915433
Episode 650 | Total Reward: -629.76 | MaxQ_Action_Count: 200 | Epsilon: 0.0010 | Elapsed: 0:00:00.798961
Episode 660 | Total Reward: -362.97 | MaxQ_Action_Count: 200 | Epsilon: 0.0010 | Elapsed: 0:00:00.805026
Episode 670 | Total Reward: -369.98 | MaxQ_Action_Count: 200 | Epsilon: 0.0010 | Elapsed: 0:00:00.960277
Episode 680 | Total Reward: -3.53 | MaxQ_Action_Count: 199 | Epsilon: 0.0010 | Elapsed: 0:00:00.938893
Episode 690 | Total Reward: -364.13 | MaxQ_Action_Count: 199 | Epsilon: 0.0010 | Elapsed: 0:00:00.797671
Episode 700 | Total Reward: -126.48 | MaxQ_Action_Count: 200 | Epsilon: 0.0010 | Elapsed: 0:00:00.818296
Episode 710 | Total Reward: -734.94 | MaxQ_Action_Count: 200 | Epsilon: 0.0010 | Elapsed: 0:00:00.822243
Episode 720 | Total Reward: -371.50 | MaxQ_Action_Count: 200 | Epsilon: 0.0010 | Elapsed: 0:00:00.789885
Episode 730 | Total Reward: -486.12 | MaxQ_Action_Count: 200 | Epsilon: 0.0010 | Elapsed: 0:00:00.822754
Episode 740 | Total Reward: -485.64 | MaxQ_Action_Count: 200 | Epsilon: 0.0010 | Elapsed: 0:00:00.772636
Episode 750 | Total Reward: -619.11 | MaxQ_Action_Count: 199 | Epsilon: 0.0010 | Elapsed: 0:00:00.801603
Episode 760 | Total Reward: -380.69 | MaxQ_Action_Count: 200 | Epsilon: 0.0010 | Elapsed: 0:00:00.843347
Episode 770 | Total Reward: -366.47 | MaxQ_Action_Count: 200 | Epsilon: 0.0010 | Elapsed: 0:00:00.802733
Episode 780 | Total Reward: -255.12 | MaxQ_Action_Count: 200 | Epsilon: 0.0010 | Elapsed: 0:00:00.889237
Episode 790 | Total Reward: -579.27 | MaxQ_Action_Count: 199 | Epsilon: 0.0010 | Elapsed: 0:00:00.818417
VISUALIZING THE PERFORMANCE OF SIMPLE DQN MODEL
HOW WILL WE IMPROVE THIS MODEL'S PERFORMANCE?
# Calculating statistical measures
average_reward = np.mean(DQN_results['total_rewards'])
median_reward = np.median(DQN_results['total_rewards'])
max_reward = np.max(DQN_results['total_rewards'])
min_reward = np.min(DQN_results['total_rewards'])
# Identifying the best episode
best_episode_index = np.argmax(DQN_results['total_rewards'])
# Printing the Statistics
print("Performance Statistics for the Simple DQN Model:")
print("--------------------------------------------")
print(f"Best Episode : {best_episode_index}")
print(f"Average Reward : {average_reward:.2f}")
print(f"Median Reward : {median_reward:.2f}")
print(f"Maximum Reward : {max_reward:.2f}")
print(f"Minimum Reward : {min_reward:.2f}")
# Plot the charts to show performance over time
plot_agent_performance(DQN_results['total_rewards'], average_reward, model_name="Simple DQN")Performance Statistics for the Simple DQN Model:
--------------------------------------------
Best Episode : 138
Average Reward : -347.25
Median Reward : -252.25
Maximum Reward : -1.52
Minimum Reward : -1775.87

VIEWING THE MODEL ARCHITECTURE AND PENDULUM ANIMATION
.eval() function for PyTorch.# Load and view the model's architecture used for DQN
trained_model = DQNAgent()
trained_model.Q.load_state_dict(torch.load("DQN/DQN138.pt"))
trained_model.Q.eval()QNetwork(
(fc_1): Linear(in_features=3, out_features=64, bias=True)
(fc_2): Linear(in_features=64, out_features=32, bias=True)
(fc_out): Linear(in_features=32, out_features=9, bias=True)
)
TESTING OUR MODEL WEIGHTS
class DQNTestAgent:
def __init__(self, weight_file_path):
self.state_dim = 3
self.action_dim = 9
self.lr = 0.01
self.trained_model = weight_file_path
self.Q = QNetwork(self.state_dim, self.action_dim, self.lr)
self.Q.load_state_dict(torch.load(self.trained_model))
def choose_action(self, state):
with torch.no_grad():
action = float(torch.argmax(self.Q(state)).numpy())
real_action = (action - 4) / 2
return real_action
agent = DQNTestAgent('DQN/DQN138.pt')
test_agent(agent, 'Simple DQN')Test reward: -127.04517067648156

MODEL TRAINING EVOLUTION
# Visualizing the pendulum's animation
create_animation(DQN_results['frames'])
We will mainly be exploring the following changes:
action_dim from 9 to 15 (Increasing the
number of discretized actions the pendulum can perform).# Defining the QNetwork class for the DQN Agent
class ImprovedQNetwork(nn.Module):
def __init__(self, state_dim, action_dim, q_lr):
super(ImprovedQNetwork, self).__init__()
self.fc_1 = nn.Linear(state_dim, 64)
self.fc_2 = nn.Linear(64, 32)
self.fc_3 = nn.Linear(32, 16) # Added another layer to the network
self.fc_out = nn.Linear(16, action_dim)
self.lr = q_lr
self.optimizer = optim.Adam(self.parameters(), lr=self.lr)
def forward(self, x):
q = F.leaky_relu(self.fc_1(x))
q = F.leaky_relu(self.fc_2(q))
q = F.leaky_relu(self.fc_3(q))
q = self.fc_out(q)
return q
# Creating a class for the DQN Agent
class ImprovedDQNAgent:
def __init__(self):
self.state_dim = 3
self.action_dim = 15 # Increased discretization of the action space
self.lr = 0.001 # Modified learning rate value by reducing it
self.gamma = 0.98
self.tau = 0.01
self.epsilon = 1.5 # Modified epsilon value by 0.5
self.epsilon_decay = 0.98
self.epsilon_min = 0.001
self.buffer_size = 100000
self.batch_size = 200
self.memory = ReplayBuffer(self.buffer_size)
self.Q = ImprovedQNetwork(self.state_dim, self.action_dim, self.lr)
self.Q_target = ImprovedQNetwork(self.state_dim, self.action_dim, self.lr)
self.Q_target.load_state_dict(self.Q.state_dict())
def choose_action(self, state):
random_number = np.random.rand()
maxQ_action_count = 0
if self.epsilon < random_number:
with torch.no_grad():
action = float(torch.argmax(self.Q(state)).numpy())
real_action = (action - 4) / 4
maxQ_action_count = 1
else:
action = np.random.choice([n for n in range(9)])
real_action = (action - 4) / 2
return action, real_action, maxQ_action_count
def calc_target(self, mini_batch):
s, a, r, s_prime, done = mini_batch
with torch.no_grad():
q_target = self.Q_target(s_prime).max(1)[0].unsqueeze(1)
target = r + self.gamma * done * q_target
return target
def train_agent(self):
mini_batch = self.memory.sample(self.batch_size)
s_batch, a_batch, r_batch, s_prime_batch, done_batch = mini_batch
a_batch = a_batch.type(torch.int64)
td_target = self.calc_target(mini_batch)
# QNetwork training
Q_a = self.Q(s_batch).gather(1, a_batch)
q_loss = F.smooth_l1_loss(Q_a, td_target)
self.Q.optimizer.zero_grad()
q_loss.mean().backward()
self.Q.optimizer.step()
# QNetwork Soft Update
for param_target, param in zip(self.Q_target.parameters(), self.Q.parameters()):
param_target.data.copy_(param_target.data * (1.0 - self.tau) + param.data * self.tau)def train_ImprovedDQNAgent():
# Initalize the DQN Agent and related variables required
agent = ImprovedDQNAgent()
env = gym.make('Pendulum-v1', g=9.81)
episodes = 800
total_rewards = []
frames = []
no_of_steps = []
success_count = 0
best_episode = 0
best_reward = float('-inf')
# Loop through the range of episodes
for episode in range(episodes):
state = env.reset()
score, done = 0.0, False
maxQ_action_count = 0
start_time = datetime.datetime.now()
while not done:
action, real_action, count = agent.choose_action(torch.FloatTensor(state))
state_prime, reward, done, _ = env.step([real_action])
agent.memory.put((state, action, reward, state_prime, done))
score += reward
maxQ_action_count += count
state = state_prime
if maxQ_action_count % 100 == 0 and score > -50:
screen = env.render(mode='rgb_array')
frames.append(screen)
if agent.memory.size() > 1000:
agent.train_agent()
# Recording results
if len(total_rewards) > 0:
success_count += (score - total_rewards[-1]) >= 200
total_rewards.append(score)
no_of_steps.append(maxQ_action_count)
if score > best_reward:
best_reward = score
best_episode = episode
# Saving the Models
save_folder = "IMPROVED DQN"
if not os.path.exists(save_folder):
os.makedirs(save_folder)
if episode == best_episode:
model_name = os.path.join(save_folder, "IMPROVED_DQN" + str(episode) + ".pt")
torch.save(agent.Q.state_dict(), model_name)
if episode % 10 == 0:
elapsed_time = datetime.datetime.now() - start_time
print('Episode {:>4} | Total Reward: {:>8.2f} | MaxQ_Action_Count:{:>5} | Epsilon: {:>4.4f} | Elapsed: {}'.format(episode, score, maxQ_action_count, agent.epsilon, elapsed_time))
if agent.epsilon > agent.epsilon_min:
agent.epsilon *= agent.epsilon_decay
env.close()
return {
'total_rewards': total_rewards,
'no_of_steps': no_of_steps,
'success_count': success_count,
'frames': frames
}
ImprovedDQN_results = train_ImprovedDQNAgent()Episode 0 | Total Reward: -1701.66 | MaxQ_Action_Count: 0 | Epsilon: 1.5000 | Elapsed: 0:00:00.465424
Episode 10 | Total Reward: -1765.65 | MaxQ_Action_Count: 0 | Epsilon: 1.2256 | Elapsed: 0:00:00.709585
Episode 20 | Total Reward: -1346.81 | MaxQ_Action_Count: 0 | Epsilon: 1.0014 | Elapsed: 0:00:00.650730
Episode 30 | Total Reward: -1693.30 | MaxQ_Action_Count: 31 | Epsilon: 0.8182 | Elapsed: 0:00:00.660949
Episode 40 | Total Reward: -647.31 | MaxQ_Action_Count: 64 | Epsilon: 0.6686 | Elapsed: 0:00:00.625166
Episode 50 | Total Reward: -760.08 | MaxQ_Action_Count: 79 | Epsilon: 0.5463 | Elapsed: 0:00:00.714982
Episode 60 | Total Reward: -908.27 | MaxQ_Action_Count: 119 | Epsilon: 0.4463 | Elapsed: 0:00:00.623377
Episode 70 | Total Reward: -253.43 | MaxQ_Action_Count: 136 | Epsilon: 0.3647 | Elapsed: 0:00:00.690183
Episode 80 | Total Reward: -842.77 | MaxQ_Action_Count: 124 | Epsilon: 0.2980 | Elapsed: 0:00:00.662831
Episode 90 | Total Reward: -362.60 | MaxQ_Action_Count: 148 | Epsilon: 0.2435 | Elapsed: 0:00:00.648018
Episode 100 | Total Reward: -242.11 | MaxQ_Action_Count: 162 | Epsilon: 0.1989 | Elapsed: 0:00:00.686762
Episode 110 | Total Reward: -246.22 | MaxQ_Action_Count: 172 | Epsilon: 0.1625 | Elapsed: 0:00:00.706575
Episode 120 | Total Reward: -253.04 | MaxQ_Action_Count: 179 | Epsilon: 0.1328 | Elapsed: 0:00:00.731047
Episode 130 | Total Reward: -125.22 | MaxQ_Action_Count: 176 | Epsilon: 0.1085 | Elapsed: 0:00:00.655017
Episode 140 | Total Reward: -248.83 | MaxQ_Action_Count: 172 | Epsilon: 0.0887 | Elapsed: 0:00:00.625905
Episode 150 | Total Reward: -355.15 | MaxQ_Action_Count: 190 | Epsilon: 0.0724 | Elapsed: 0:00:00.651852
Episode 160 | Total Reward: -122.48 | MaxQ_Action_Count: 188 | Epsilon: 0.0592 | Elapsed: 0:00:00.685368
Episode 170 | Total Reward: -124.56 | MaxQ_Action_Count: 190 | Epsilon: 0.0484 | Elapsed: 0:00:00.661211
Episode 180 | Total Reward: -120.08 | MaxQ_Action_Count: 196 | Epsilon: 0.0395 | Elapsed: 0:00:00.744368
Episode 190 | Total Reward: -445.75 | MaxQ_Action_Count: 194 | Epsilon: 0.0323 | Elapsed: 0:00:00.734945
Episode 200 | Total Reward: -235.56 | MaxQ_Action_Count: 187 | Epsilon: 0.0264 | Elapsed: 0:00:00.720266
Episode 210 | Total Reward: -120.48 | MaxQ_Action_Count: 197 | Epsilon: 0.0216 | Elapsed: 0:00:00.678907
Episode 220 | Total Reward: -0.84 | MaxQ_Action_Count: 196 | Epsilon: 0.0176 | Elapsed: 0:00:00.691770
Episode 230 | Total Reward: -231.33 | MaxQ_Action_Count: 196 | Epsilon: 0.0144 | Elapsed: 0:00:00.729621
Episode 240 | Total Reward: -124.47 | MaxQ_Action_Count: 199 | Epsilon: 0.0118 | Elapsed: 0:00:00.720605
Episode 250 | Total Reward: -368.18 | MaxQ_Action_Count: 199 | Epsilon: 0.0096 | Elapsed: 0:00:00.706631
Episode 260 | Total Reward: -252.77 | MaxQ_Action_Count: 198 | Epsilon: 0.0079 | Elapsed: 0:00:00.736654
Episode 270 | Total Reward: -122.60 | MaxQ_Action_Count: 199 | Epsilon: 0.0064 | Elapsed: 0:00:00.672207
Episode 280 | Total Reward: -239.90 | MaxQ_Action_Count: 200 | Epsilon: 0.0052 | Elapsed: 0:00:00.714770
Episode 290 | Total Reward: -126.78 | MaxQ_Action_Count: 200 | Epsilon: 0.0043 | Elapsed: 0:00:00.733234
Episode 300 | Total Reward: -123.97 | MaxQ_Action_Count: 200 | Epsilon: 0.0035 | Elapsed: 0:00:00.712044
Episode 310 | Total Reward: -246.68 | MaxQ_Action_Count: 198 | Epsilon: 0.0029 | Elapsed: 0:00:00.663007
Episode 320 | Total Reward: -124.61 | MaxQ_Action_Count: 200 | Epsilon: 0.0023 | Elapsed: 0:00:00.702435
Episode 330 | Total Reward: -119.87 | MaxQ_Action_Count: 200 | Epsilon: 0.0019 | Elapsed: 0:00:00.719391
Episode 340 | Total Reward: -364.39 | MaxQ_Action_Count: 198 | Epsilon: 0.0016 | Elapsed: 0:00:00.763188
Episode 350 | Total Reward: -390.23 | MaxQ_Action_Count: 200 | Epsilon: 0.0013 | Elapsed: 0:00:00.752954
Episode 360 | Total Reward: -243.66 | MaxQ_Action_Count: 200 | Epsilon: 0.0010 | Elapsed: 0:00:00.721182
Episode 370 | Total Reward: -126.84 | MaxQ_Action_Count: 200 | Epsilon: 0.0010 | Elapsed: 0:00:00.774583
Episode 380 | Total Reward: -126.41 | MaxQ_Action_Count: 199 | Epsilon: 0.0010 | Elapsed: 0:00:00.778766
Episode 390 | Total Reward: -127.95 | MaxQ_Action_Count: 200 | Epsilon: 0.0010 | Elapsed: 0:00:00.685989
Episode 400 | Total Reward: -125.55 | MaxQ_Action_Count: 200 | Epsilon: 0.0010 | Elapsed: 0:00:00.897287
Episode 410 | Total Reward: -122.37 | MaxQ_Action_Count: 200 | Epsilon: 0.0010 | Elapsed: 0:00:00.618388
Episode 420 | Total Reward: -117.74 | MaxQ_Action_Count: 200 | Epsilon: 0.0010 | Elapsed: 0:00:00.788237
Episode 430 | Total Reward: -124.20 | MaxQ_Action_Count: 200 | Epsilon: 0.0010 | Elapsed: 0:00:00.736342
Episode 440 | Total Reward: -125.99 | MaxQ_Action_Count: 200 | Epsilon: 0.0010 | Elapsed: 0:00:00.765104
Episode 450 | Total Reward: -125.99 | MaxQ_Action_Count: 200 | Epsilon: 0.0010 | Elapsed: 0:00:00.812960
Episode 460 | Total Reward: -355.16 | MaxQ_Action_Count: 200 | Epsilon: 0.0010 | Elapsed: 0:00:00.737551
Episode 470 | Total Reward: -128.87 | MaxQ_Action_Count: 199 | Epsilon: 0.0010 | Elapsed: 0:00:00.736732
Episode 480 | Total Reward: -120.41 | MaxQ_Action_Count: 200 | Epsilon: 0.0010 | Elapsed: 0:00:01.098357
Episode 490 | Total Reward: -122.67 | MaxQ_Action_Count: 200 | Epsilon: 0.0010 | Elapsed: 0:00:00.667135
Episode 500 | Total Reward: -415.92 | MaxQ_Action_Count: 200 | Epsilon: 0.0010 | Elapsed: 0:00:00.663800
Episode 510 | Total Reward: -366.80 | MaxQ_Action_Count: 200 | Epsilon: 0.0010 | Elapsed: 0:00:00.658987
Episode 520 | Total Reward: -124.80 | MaxQ_Action_Count: 200 | Epsilon: 0.0010 | Elapsed: 0:00:00.668457
Episode 530 | Total Reward: -233.67 | MaxQ_Action_Count: 200 | Epsilon: 0.0010 | Elapsed: 0:00:00.628503
Episode 540 | Total Reward: -237.89 | MaxQ_Action_Count: 200 | Epsilon: 0.0010 | Elapsed: 0:00:00.694867
Episode 550 | Total Reward: -333.84 | MaxQ_Action_Count: 200 | Epsilon: 0.0010 | Elapsed: 0:00:00.673737
Episode 560 | Total Reward: -3.03 | MaxQ_Action_Count: 200 | Epsilon: 0.0010 | Elapsed: 0:00:00.660887
Episode 570 | Total Reward: -274.22 | MaxQ_Action_Count: 200 | Epsilon: 0.0010 | Elapsed: 0:00:00.638734
Episode 580 | Total Reward: -360.31 | MaxQ_Action_Count: 200 | Epsilon: 0.0010 | Elapsed: 0:00:00.635958
Episode 590 | Total Reward: -240.42 | MaxQ_Action_Count: 200 | Epsilon: 0.0010 | Elapsed: 0:00:00.629189
Episode 600 | Total Reward: -123.62 | MaxQ_Action_Count: 199 | Epsilon: 0.0010 | Elapsed: 0:00:00.666066
Episode 610 | Total Reward: -127.02 | MaxQ_Action_Count: 200 | Epsilon: 0.0010 | Elapsed: 0:00:00.653596
Episode 620 | Total Reward: -230.49 | MaxQ_Action_Count: 200 | Epsilon: 0.0010 | Elapsed: 0:00:00.661901
Episode 630 | Total Reward: -126.38 | MaxQ_Action_Count: 199 | Epsilon: 0.0010 | Elapsed: 0:00:00.614376
Episode 640 | Total Reward: -484.51 | MaxQ_Action_Count: 199 | Epsilon: 0.0010 | Elapsed: 0:00:00.626510
Episode 650 | Total Reward: -122.01 | MaxQ_Action_Count: 200 | Epsilon: 0.0010 | Elapsed: 0:00:00.679646
Episode 660 | Total Reward: -124.51 | MaxQ_Action_Count: 200 | Epsilon: 0.0010 | Elapsed: 0:00:00.576189
Episode 670 | Total Reward: -357.23 | MaxQ_Action_Count: 200 | Epsilon: 0.0010 | Elapsed: 0:00:00.655225
Episode 680 | Total Reward: -127.42 | MaxQ_Action_Count: 200 | Epsilon: 0.0010 | Elapsed: 0:00:00.719835
Episode 690 | Total Reward: -126.34 | MaxQ_Action_Count: 199 | Epsilon: 0.0010 | Elapsed: 0:00:00.691893
Episode 700 | Total Reward: -130.80 | MaxQ_Action_Count: 200 | Epsilon: 0.0010 | Elapsed: 0:00:00.647632
Episode 710 | Total Reward: -235.25 | MaxQ_Action_Count: 200 | Epsilon: 0.0010 | Elapsed: 0:00:00.664781
Episode 720 | Total Reward: -2.75 | MaxQ_Action_Count: 200 | Epsilon: 0.0010 | Elapsed: 0:00:00.676082
Episode 730 | Total Reward: -234.89 | MaxQ_Action_Count: 199 | Epsilon: 0.0010 | Elapsed: 0:00:00.658749
Episode 740 | Total Reward: -258.37 | MaxQ_Action_Count: 200 | Epsilon: 0.0010 | Elapsed: 0:00:00.674140
Episode 750 | Total Reward: -2.35 | MaxQ_Action_Count: 200 | Epsilon: 0.0010 | Elapsed: 0:00:00.619855
Episode 760 | Total Reward: -2.99 | MaxQ_Action_Count: 200 | Epsilon: 0.0010 | Elapsed: 0:00:00.648557
Episode 770 | Total Reward: -312.24 | MaxQ_Action_Count: 200 | Epsilon: 0.0010 | Elapsed: 0:00:00.704718
Episode 780 | Total Reward: -124.80 | MaxQ_Action_Count: 200 | Epsilon: 0.0010 | Elapsed: 0:00:00.611422
Episode 790 | Total Reward: -123.40 | MaxQ_Action_Count: 199 | Epsilon: 0.0010 | Elapsed: 0:00:00.784165
VISUALIZING THE PERFORMANCE OF THE ENHANCED DQN MODEL
# Calculating statistical measures
average_reward = np.mean(ImprovedDQN_results['total_rewards'])
median_reward = np.median(ImprovedDQN_results['total_rewards'])
max_reward = np.max(ImprovedDQN_results['total_rewards'])
min_reward = np.min(ImprovedDQN_results['total_rewards'])
# Identifying the best episode
best_episode_index = np.argmax(ImprovedDQN_results['total_rewards'])
# Printing the Statistics
print("Performance Statistics for the Improved DQN Model:")
print("--------------------------------------------")
print(f"Best Episode : {best_episode_index}")
print(f"Average Reward : {average_reward:.2f}")
print(f"Median Reward : {median_reward:.2f}")
print(f"Maximum Reward : {max_reward:.2f}")
print(f"Minimum Reward : {min_reward:.2f}")
# Plot the charts to show performance over time
plot_agent_performance(ImprovedDQN_results['total_rewards'], average_reward, model_name="Improved DQN")Performance Statistics for the Improved DQN Model:
--------------------------------------------
Best Episode : 172
Average Reward : -265.13
Median Reward : -130.56
Maximum Reward : -0.60
Minimum Reward : -1796.05

VIEWING THE MODEL ARCHITECTURE AND PENDULUM ANIMATION
.eval() function for
PyTorch.# Load and view the model's architecture used for DQN
trained_model = ImprovedDQNAgent()
trained_model.Q.load_state_dict(torch.load("IMPROVED DQN/IMPROVED_DQN172.pt"))
trained_model.Q.eval()ImprovedQNetwork(
(fc_1): Linear(in_features=3, out_features=64, bias=True)
(fc_2): Linear(in_features=64, out_features=32, bias=True)
(fc_3): Linear(in_features=32, out_features=16, bias=True)
(fc_out): Linear(in_features=16, out_features=15, bias=True)
)
TESTING OUR MODEL WEIGHTS
# Creating a class for the DQN Agent
class ImprovedDQNTestAgent:
def __init__(self, weight_file_path):
self.state_dim = 3
self.action_dim = 15 # Increased discretization of the action space
self.lr = 0.001 # Modified learning rate value by reducing it
self.trained_model = weight_file_path
self.Q = ImprovedQNetwork(self.state_dim, self.action_dim, self.lr)
self.Q.load_state_dict(torch.load(self.trained_model))
def choose_action(self, state):
with torch.no_grad():
action = float(torch.argmax(self.Q(state)).numpy())
real_action = (action - 4) / 2
return real_action
agent = ImprovedDQNTestAgent('IMPROVED DQN/IMPROVED_DQN172.pt')
test_agent(agent, 'Improved DQN')Test reward: -120.75985653386435

MODEL TRAINING EVOLUTION
# Visualizing the pendulum's animation
create_animation(ImprovedDQN_results['frames'])
The Double Deep-Q Network (DDQN) is an advanced reinforcement learning model that builds upon the architecture of the Deep-Q Network (DQN). It addresses a critical shortcoming in the DQN, namely the overestimation of action values due to the same network being used for both selecting and evaluating an action.

WHAT ARE THE ADVANTAGES OF DDQN?
The Double Deep-Q Network (DDQN) offers significant advantages over the traditional Deep-Q Network (DQN) in terms of learning accuracy and stability. By separating action selection and value estimation between two neural networks, DDQN effectively reduces the overestimation bias common in DQNs. This separation ensures more reliable and stable learning outcomes. Additionally, the strategy of using delayed updates for the target network contributes to the overall stability of the learning process. Furthermore, DDQN typically exhibits enhanced performance, especially in environments characterized by noisy or misleading reward signals, demonstrating its superiority in complex learning scenarios.
SETTING UP THE MODEL ARCHITECTURE FOR THE DDQN MODEL
Below contains the sections changed to suit DDQN's architecture:
choose_action
function are selected using the target network
self.Q_target.# Defining the QNetwork class for the DDQN Agent
class QNetwork(nn.Module):
def __init__(self, state_dim, action_dim, q_lr):
super(QNetwork, self).__init__()
self.fc_1 = nn.Linear(state_dim, 64)
self.fc_2 = nn.Linear(64, 32)
self.fc_3 = nn.Linear(32, 16)
self.fc_out = nn.Linear(16, action_dim)
self.lr = q_lr
self.optimizer = optim.Adam(self.parameters(), lr=self.lr)
def forward(self, x):
q = F.leaky_relu(self.fc_1(x))
q = F.leaky_relu(self.fc_2(q))
q = F.leaky_relu(self.fc_3(q))
q = self.fc_out(q)
return q
# Creating a class for the DDQN Agent
class DDQNAgent:
def __init__(self):
self.state_dim = 3
self.action_dim = 15
self.lr = 0.001
self.gamma = 0.98
self.tau = 0.01
self.epsilon = 1.5
self.epsilon_decay = 0.98
self.epsilon_min = 0.001
self.buffer_size = 100000
self.batch_size = 200
self.memory = ReplayBuffer(self.buffer_size)
self.Q = QNetwork(self.state_dim, self.action_dim, self.lr)
self.Q_target = QNetwork(self.state_dim, self.action_dim, self.lr)
self.Q_target.load_state_dict(self.Q.state_dict())
def choose_action(self, state):
random_number = np.random.rand()
maxQ_action_count = 0
if self.epsilon < random_number:
with torch.no_grad():
# Use Q_target for action selection
action = float(torch.argmax(self.Q_target(state)).numpy())
real_action = (action - 4) / 4
maxQ_action_count = 1
else:
action = np.random.choice([n for n in range(9)])
real_action = (action - 4) / 2
return action, real_action, maxQ_action_count
def calc_target(self, mini_batch):
s, a, r, s_prime, done = mini_batch
with torch.no_grad():
# Use Q for action selection
best_next_action = torch.argmax(self.Q(s_prime), dim=1, keepdim=True)
q_target = self.Q_target(s_prime).gather(1, best_next_action)
target = r + self.gamma * done * q_target
return target
def train_agent(self):
mini_batch = self.memory.sample(self.batch_size)
s_batch, a_batch, r_batch, s_prime_batch, done_batch = mini_batch
a_batch = a_batch.type(torch.int64)
td_target = self.calc_target(mini_batch)
# QNetwork training
Q_a = self.Q(s_batch).gather(1, a_batch)
q_loss = F.smooth_l1_loss(Q_a, td_target)
self.Q.optimizer.zero_grad()
q_loss.mean().backward()
self.Q.optimizer.step()
# QNetwork Soft Update for DDQN
for param_target, param in zip(self.Q_target.parameters(), self.Q.parameters()):
param_target.data.copy_(self.tau * param.data + (1.0 - self.tau) * param_target.data)def train_DDQNAgent():
# Initalize the DQN Agent and related variables required
agent = DDQNAgent()
env = gym.make('Pendulum-v1', g=9.81)
episodes = 800
total_rewards = []
success_count = 0
no_of_steps = []
frames = []
best_episode = 0
best_reward = float('-inf')
# Loop through the range of episodes
for episode in range(episodes):
state = env.reset()
score, done = 0.0, False
maxQ_action_count = 0
start_time = datetime.datetime.now()
while not done:
action, real_action, count = agent.choose_action(torch.FloatTensor(state))
state_prime, reward, done, _ = env.step([real_action])
agent.memory.put((state, action, reward, state_prime, done))
score += reward
maxQ_action_count += count
state = state_prime
if maxQ_action_count % 100 == 0 and score > -50:
screen = env.render(mode='rgb_array')
frames.append(screen)
if agent.memory.size() > 1000:
agent.train_agent()
# Recording results
if len(total_rewards) > 0:
success_count += (score - total_rewards[-1]) >= 200
total_rewards.append(score)
no_of_steps.append(maxQ_action_count)
if score > best_reward:
best_reward = score
best_episode = episode
# Saving the Models
save_folder = "DDQN"
if not os.path.exists(save_folder):
os.makedirs(save_folder)
if episode == best_episode:
model_name = os.path.join(save_folder, "DDQN" + str(episode) + ".pt")
torch.save(agent.Q.state_dict(), model_name)
if episode % 10 == 0:
elapsed_time = datetime.datetime.now() - start_time
print('Episode {:>4} | Total Reward: {:>8.2f} | MaxQ_Action_Count:{:>5} | Epsilon: {:>4.4f} | Elapsed: {}'.format(episode, score, maxQ_action_count, agent.epsilon, elapsed_time))
if agent.epsilon > agent.epsilon_min:
agent.epsilon *= agent.epsilon_decay
env.close()
return {
'total_rewards': total_rewards,
'no_of_steps': no_of_steps,
'success_count': success_count,
'frames': frames
}
DDQN_results = train_DDQNAgent()Episode 0 | Total Reward: -1080.22 | MaxQ_Action_Count: 0 | Epsilon: 1.5000 | Elapsed: 0:00:00.492155
Episode 10 | Total Reward: -784.74 | MaxQ_Action_Count: 0 | Epsilon: 1.2256 | Elapsed: 0:00:00.857798
Episode 20 | Total Reward: -1393.62 | MaxQ_Action_Count: 0 | Epsilon: 1.0014 | Elapsed: 0:00:00.723278
Episode 30 | Total Reward: -972.29 | MaxQ_Action_Count: 32 | Epsilon: 0.8182 | Elapsed: 0:00:00.691218
Episode 40 | Total Reward: -1041.89 | MaxQ_Action_Count: 64 | Epsilon: 0.6686 | Elapsed: 0:00:00.558482
Episode 50 | Total Reward: -628.97 | MaxQ_Action_Count: 80 | Epsilon: 0.5463 | Elapsed: 0:00:00.693278
Episode 60 | Total Reward: -508.24 | MaxQ_Action_Count: 114 | Epsilon: 0.4463 | Elapsed: 0:00:00.717819
Episode 70 | Total Reward: -243.42 | MaxQ_Action_Count: 133 | Epsilon: 0.3647 | Elapsed: 0:00:00.712669
Episode 80 | Total Reward: -122.18 | MaxQ_Action_Count: 144 | Epsilon: 0.2980 | Elapsed: 0:00:00.772529
Episode 90 | Total Reward: -237.63 | MaxQ_Action_Count: 153 | Epsilon: 0.2435 | Elapsed: 0:00:00.689303
Episode 100 | Total Reward: -269.47 | MaxQ_Action_Count: 157 | Epsilon: 0.1989 | Elapsed: 0:00:00.661259
Episode 110 | Total Reward: -360.77 | MaxQ_Action_Count: 175 | Epsilon: 0.1625 | Elapsed: 0:00:00.783340
Episode 120 | Total Reward: -121.18 | MaxQ_Action_Count: 179 | Epsilon: 0.1328 | Elapsed: 0:00:00.762387
Episode 130 | Total Reward: -125.81 | MaxQ_Action_Count: 181 | Epsilon: 0.1085 | Elapsed: 0:00:00.647560
Episode 140 | Total Reward: -233.61 | MaxQ_Action_Count: 181 | Epsilon: 0.0887 | Elapsed: 0:00:00.727607
Episode 150 | Total Reward: -122.27 | MaxQ_Action_Count: 182 | Epsilon: 0.0724 | Elapsed: 0:00:00.534581
Episode 160 | Total Reward: -122.22 | MaxQ_Action_Count: 189 | Epsilon: 0.0592 | Elapsed: 0:00:00.583074
Episode 170 | Total Reward: -123.95 | MaxQ_Action_Count: 191 | Epsilon: 0.0484 | Elapsed: 0:00:00.576898
Episode 180 | Total Reward: -124.39 | MaxQ_Action_Count: 189 | Epsilon: 0.0395 | Elapsed: 0:00:00.561412
Episode 190 | Total Reward: -382.09 | MaxQ_Action_Count: 194 | Epsilon: 0.0323 | Elapsed: 0:00:00.572251
Episode 200 | Total Reward: -121.65 | MaxQ_Action_Count: 195 | Epsilon: 0.0264 | Elapsed: 0:00:00.587561
Episode 210 | Total Reward: -120.60 | MaxQ_Action_Count: 198 | Epsilon: 0.0216 | Elapsed: 0:00:00.575041
Episode 220 | Total Reward: -114.98 | MaxQ_Action_Count: 199 | Epsilon: 0.0176 | Elapsed: 0:00:00.683037
Episode 230 | Total Reward: -122.13 | MaxQ_Action_Count: 200 | Epsilon: 0.0144 | Elapsed: 0:00:00.763769
Episode 240 | Total Reward: -246.70 | MaxQ_Action_Count: 199 | Epsilon: 0.0118 | Elapsed: 0:00:00.580312
Episode 250 | Total Reward: -231.83 | MaxQ_Action_Count: 199 | Epsilon: 0.0096 | Elapsed: 0:00:01.545827
Episode 260 | Total Reward: -234.21 | MaxQ_Action_Count: 199 | Epsilon: 0.0079 | Elapsed: 0:00:00.588389
Episode 270 | Total Reward: -378.82 | MaxQ_Action_Count: 198 | Epsilon: 0.0064 | Elapsed: 0:00:00.604841
Episode 280 | Total Reward: -119.93 | MaxQ_Action_Count: 200 | Epsilon: 0.0052 | Elapsed: 0:00:00.645680
Episode 290 | Total Reward: -125.49 | MaxQ_Action_Count: 199 | Epsilon: 0.0043 | Elapsed: 0:00:00.610931
Episode 300 | Total Reward: -243.49 | MaxQ_Action_Count: 199 | Epsilon: 0.0035 | Elapsed: 0:00:00.599383
Episode 310 | Total Reward: -242.24 | MaxQ_Action_Count: 200 | Epsilon: 0.0029 | Elapsed: 0:00:00.677754
Episode 320 | Total Reward: -355.14 | MaxQ_Action_Count: 198 | Epsilon: 0.0023 | Elapsed: 0:00:00.603506
Episode 330 | Total Reward: -120.02 | MaxQ_Action_Count: 200 | Epsilon: 0.0019 | Elapsed: 0:00:00.639646
Episode 340 | Total Reward: -123.97 | MaxQ_Action_Count: 200 | Epsilon: 0.0016 | Elapsed: 0:00:00.577731
Episode 350 | Total Reward: -1.11 | MaxQ_Action_Count: 199 | Epsilon: 0.0013 | Elapsed: 0:00:00.622338
Episode 360 | Total Reward: -245.18 | MaxQ_Action_Count: 200 | Epsilon: 0.0010 | Elapsed: 0:00:00.641010
Episode 370 | Total Reward: -233.93 | MaxQ_Action_Count: 200 | Epsilon: 0.0010 | Elapsed: 0:00:00.608609
Episode 380 | Total Reward: -125.98 | MaxQ_Action_Count: 199 | Epsilon: 0.0010 | Elapsed: 0:00:00.616386
Episode 390 | Total Reward: -248.56 | MaxQ_Action_Count: 200 | Epsilon: 0.0010 | Elapsed: 0:00:00.637242
Episode 400 | Total Reward: -2.12 | MaxQ_Action_Count: 200 | Epsilon: 0.0010 | Elapsed: 0:00:00.649135
Episode 410 | Total Reward: -244.65 | MaxQ_Action_Count: 200 | Epsilon: 0.0010 | Elapsed: 0:00:00.613399
Episode 420 | Total Reward: -127.14 | MaxQ_Action_Count: 200 | Epsilon: 0.0010 | Elapsed: 0:00:00.620103
Episode 430 | Total Reward: -1.47 | MaxQ_Action_Count: 200 | Epsilon: 0.0010 | Elapsed: 0:00:00.671880
Episode 440 | Total Reward: -115.95 | MaxQ_Action_Count: 200 | Epsilon: 0.0010 | Elapsed: 0:00:00.691710
Episode 450 | Total Reward: -124.82 | MaxQ_Action_Count: 200 | Epsilon: 0.0010 | Elapsed: 0:00:00.657961
Episode 460 | Total Reward: -126.57 | MaxQ_Action_Count: 200 | Epsilon: 0.0010 | Elapsed: 0:00:00.621857
Episode 470 | Total Reward: -252.93 | MaxQ_Action_Count: 200 | Epsilon: 0.0010 | Elapsed: 0:00:00.626950
Episode 480 | Total Reward: -1.16 | MaxQ_Action_Count: 198 | Epsilon: 0.0010 | Elapsed: 0:00:00.706426
Episode 490 | Total Reward: -117.87 | MaxQ_Action_Count: 200 | Epsilon: 0.0010 | Elapsed: 0:00:00.881479
Episode 500 | Total Reward: -246.50 | MaxQ_Action_Count: 200 | Epsilon: 0.0010 | Elapsed: 0:00:00.857655
Episode 510 | Total Reward: -128.50 | MaxQ_Action_Count: 199 | Epsilon: 0.0010 | Elapsed: 0:00:00.881455
Episode 520 | Total Reward: -127.42 | MaxQ_Action_Count: 200 | Epsilon: 0.0010 | Elapsed: 0:00:00.940179
Episode 530 | Total Reward: -123.36 | MaxQ_Action_Count: 200 | Epsilon: 0.0010 | Elapsed: 0:00:00.805455
Episode 540 | Total Reward: -124.35 | MaxQ_Action_Count: 200 | Epsilon: 0.0010 | Elapsed: 0:00:00.808052
Episode 550 | Total Reward: -119.67 | MaxQ_Action_Count: 200 | Epsilon: 0.0010 | Elapsed: 0:00:00.763212
Episode 560 | Total Reward: -128.98 | MaxQ_Action_Count: 200 | Epsilon: 0.0010 | Elapsed: 0:00:00.776317
Episode 570 | Total Reward: -248.52 | MaxQ_Action_Count: 200 | Epsilon: 0.0010 | Elapsed: 0:00:00.783325
Episode 580 | Total Reward: -123.06 | MaxQ_Action_Count: 199 | Epsilon: 0.0010 | Elapsed: 0:00:00.851112
Episode 590 | Total Reward: -120.66 | MaxQ_Action_Count: 200 | Epsilon: 0.0010 | Elapsed: 0:00:00.885875
Episode 600 | Total Reward: -235.40 | MaxQ_Action_Count: 200 | Epsilon: 0.0010 | Elapsed: 0:00:00.808629
Episode 610 | Total Reward: -1.60 | MaxQ_Action_Count: 200 | Epsilon: 0.0010 | Elapsed: 0:00:00.828082
Episode 620 | Total Reward: -2.54 | MaxQ_Action_Count: 200 | Epsilon: 0.0010 | Elapsed: 0:00:00.772766
Episode 630 | Total Reward: -238.89 | MaxQ_Action_Count: 200 | Epsilon: 0.0010 | Elapsed: 0:00:00.802408
Episode 640 | Total Reward: -121.04 | MaxQ_Action_Count: 200 | Epsilon: 0.0010 | Elapsed: 0:00:00.853356
Episode 650 | Total Reward: -126.19 | MaxQ_Action_Count: 200 | Epsilon: 0.0010 | Elapsed: 0:00:00.837425
Episode 660 | Total Reward: -124.32 | MaxQ_Action_Count: 199 | Epsilon: 0.0010 | Elapsed: 0:00:00.766248
Episode 670 | Total Reward: -124.67 | MaxQ_Action_Count: 200 | Epsilon: 0.0010 | Elapsed: 0:00:00.783132
Episode 680 | Total Reward: -0.32 | MaxQ_Action_Count: 200 | Epsilon: 0.0010 | Elapsed: 0:00:00.706073
Episode 690 | Total Reward: -360.14 | MaxQ_Action_Count: 200 | Epsilon: 0.0010 | Elapsed: 0:00:00.624401
Episode 700 | Total Reward: -124.49 | MaxQ_Action_Count: 200 | Epsilon: 0.0010 | Elapsed: 0:00:00.622957
Episode 710 | Total Reward: -0.72 | MaxQ_Action_Count: 200 | Epsilon: 0.0010 | Elapsed: 0:00:00.673523
Episode 720 | Total Reward: -362.04 | MaxQ_Action_Count: 199 | Epsilon: 0.0010 | Elapsed: 0:00:00.629440
Episode 730 | Total Reward: -383.93 | MaxQ_Action_Count: 199 | Epsilon: 0.0010 | Elapsed: 0:00:00.625617
Episode 740 | Total Reward: -231.58 | MaxQ_Action_Count: 200 | Epsilon: 0.0010 | Elapsed: 0:00:00.659562
Episode 750 | Total Reward: -364.26 | MaxQ_Action_Count: 200 | Epsilon: 0.0010 | Elapsed: 0:00:00.627990
Episode 760 | Total Reward: -124.43 | MaxQ_Action_Count: 200 | Epsilon: 0.0010 | Elapsed: 0:00:00.653057
Episode 770 | Total Reward: -118.20 | MaxQ_Action_Count: 200 | Epsilon: 0.0010 | Elapsed: 0:00:00.639233
Episode 780 | Total Reward: -352.28 | MaxQ_Action_Count: 200 | Epsilon: 0.0010 | Elapsed: 0:00:00.665880
Episode 790 | Total Reward: -1.09 | MaxQ_Action_Count: 200 | Epsilon: 0.0010 | Elapsed: 0:00:00.697581
VISUALIZING THE PERFORMANCE OF THE DOUBLE DQN MODEL
# Calculating statistical measures
average_reward = np.mean(DDQN_results['total_rewards'])
median_reward = np.median(DDQN_results['total_rewards'])
max_reward = np.max(DDQN_results['total_rewards'])
min_reward = np.min(DDQN_results['total_rewards'])
# Identifying the best episode
best_episode_index = np.argmax(DDQN_results['total_rewards'])
# Printing the Statistics
print("Performance Statistics for the Double DQN Model:")
print("--------------------------------------------")
print(f"Best Episode : {best_episode_index}")
print(f"Average Reward : {average_reward:.2f}")
print(f"Median Reward : {median_reward:.2f}")
print(f"Maximum Reward : {max_reward:.2f}")
print(f"Minimum Reward : {min_reward:.2f}")
# Plot the charts to show performance over time
plot_agent_performance(DDQN_results['total_rewards'], average_reward, model_name="Double DQN")Performance Statistics for the Double DQN Model:
--------------------------------------------
Best Episode : 776
Average Reward : -261.91
Median Reward : -129.09
Maximum Reward : -0.26
Minimum Reward : -1756.07

VIEWING THE MODEL ARCHITECTURE AND PENDULUM ANIMATION
.eval() function for
PyTorch.# Load and view the model's architecture used for DDQN
trained_model = DDQNAgent()
trained_model.Q.load_state_dict(torch.load("DDQN/DDQN776.pt"))
trained_model.Q.eval()QNetwork(
(fc_1): Linear(in_features=3, out_features=64, bias=True)
(fc_2): Linear(in_features=64, out_features=32, bias=True)
(fc_3): Linear(in_features=32, out_features=16, bias=True)
(fc_out): Linear(in_features=16, out_features=15, bias=True)
)
TESTING OUR MODEL WEIGHTS
class DDQNTestAgent:
def __init__(self, weight_file_path):
self.state_dim = 3
self.action_dim = 15
self.lr = 0.001
self.trained_model = weight_file_path
self.Q = QNetwork(self.state_dim, self.action_dim, self.lr)
self.Q.load_state_dict(torch.load(self.trained_model))
def choose_action(self, state):
with torch.no_grad():
action = float(torch.argmax(self.Q(state)).numpy())
real_action = (action - 4) / 2
return real_action
agent = DDQNTestAgent("DDQN/DDQN776.pt")
test_agent(agent, 'DDQN')Test reward: -367.44327135316513

MODEL TRAINING EVOLUTION
# Visualizing the pendulum's animation
create_animation(DDQN_results['frames'])
The Soft Actor-Critic Network is an agent that employs a stochastic policy for action selection, enabling it to capture the inherent uncertainty in many real-world environments. This stochasticity helps SAC to explore better and handle environments with continuous action spaces, which is suitable in the case of the Pendulum task.
For SAC, it introduces an entropy term into the objective function. This term encourages the policy to take actions that are not only rewarding but also diverse. It prevents premature convergence to suboptimal policies and aids in exploration. At the same time, SAC uses a soft value function, allowing it to handle both continuous and discrete action spaces seamlessly.
WHAT ARE THE ADVANTAGES OF SAC?
Stochastic Policies: SAC's use of stochastic policies allows for better exploration, especially in environments with continuous action spaces, where deterministic policies may struggle.
Entropy Regularization: The inclusion of an entropy regularization term encourages diverse actions and robust exploration, preventing the algorithm from getting stuck in suboptimal solutions.
Sample Efficiency: Being an off-policy algorithm, SAC can make more efficient use of past experiences, reducing the need for extensive interaction with the environment.
Versatility: SAC can handle both continuous and discrete action spaces, making it suitable for a wide range of reinforcement learning tasks.
Actor-Critic Separation: Separating the actor and critic networks reduces overestimation bias and contributes to more stable learning.
SETTING UP THE MODEL ARCHITECTURE FOR THE SAC MODEL
# Defining the PolicyNetwork class for the SAC Agent
class PolicyNetwork(nn.Module):
def __init__(self, state_dim, action_dim, actor_lr):
super(PolicyNetwork, self).__init__()
self.fc_1 = nn.Linear(state_dim, 64)
self.fc_2 = nn.Linear(64, 64)
self.fc_mu = nn.Linear(64, action_dim)
self.fc_std = nn.Linear(64, action_dim)
self.lr = actor_lr
self.LOG_STD_MIN = -20
self.LOG_STD_MAX = 2
self.max_action = 2
self.min_action = -2
self.action_scale = (self.max_action - self.min_action) / 2.0
self.action_bias = (self.max_action + self.min_action) / 2.0
self.optimizer = optim.Adam(self.parameters(), lr=self.lr)
def forward(self, x):
x = F.leaky_relu(self.fc_1(x))
x = F.leaky_relu(self.fc_2(x))
mu = self.fc_mu(x)
log_std = self.fc_std(x)
log_std = torch.clamp(log_std, self.LOG_STD_MIN, self.LOG_STD_MAX)
return mu, log_std
def sample(self, state):
mean, log_std = self.forward(state)
std = torch.exp(log_std)
reparameter = Normal(mean, std)
x_t = reparameter.rsample()
y_t = torch.tanh(x_t)
action = self.action_scale * y_t + self.action_bias
# # Enforcing Action Bound
log_prob = reparameter.log_prob(x_t)
log_prob = log_prob - torch.sum(torch.log(self.action_scale * (1 - y_t.pow(2)) + 1e-6), dim=-1, keepdim=True)
return action, log_prob
# Defining the QNetwork class for the SAC Agent
class QNetwork(nn.Module):
def __init__(self, state_dim, action_dim, critic_lr):
super(QNetwork, self).__init__()
self.fc_s = nn.Linear(state_dim, 32)
self.fc_a = nn.Linear(action_dim, 32)
self.fc_1 = nn.Linear(64, 64)
self.fc_out = nn.Linear(64, action_dim)
self.lr = critic_lr
self.optimizer = optim.Adam(self.parameters(), lr=self.lr)
def forward(self, x, a):
h1 = F.leaky_relu(self.fc_s(x))
h2 = F.leaky_relu(self.fc_a(a))
cat = torch.cat([h1, h2], dim=-1)
q = F.leaky_relu(self.fc_1(cat))
q = self.fc_out(q)
return q
# Creating and defining the SAC Agent
class SACAgent:
def __init__(self):
self.state_dim = 3
self.action_dim = 1
self.lr_pi = 0.001
self.lr_q = 0.001
self.gamma = 0.98
self.batch_size = 200
self.buffer_limit = 100000
self.tau = 0.005
self.init_alpha = 0.01
self.target_entropy = -self.action_dim
self.lr_alpha = 0.005
self.memory = ReplayBuffer(self.buffer_limit)
self.log_alpha = torch.tensor(np.log(self.init_alpha))
self.log_alpha.requires_grad = True
self.log_alpha_optimizer = optim.Adam([self.log_alpha], lr=self.lr_alpha)
self.PI = PolicyNetwork(self.state_dim, self.action_dim, self.lr_pi)
self.Q1 = QNetwork(self.state_dim, self.action_dim, self.lr_q)
self.Q1_target = QNetwork(self.state_dim, self.action_dim, self.lr_q)
self.Q2 = QNetwork(self.state_dim, self.action_dim, self.lr_q)
self.Q2_target = QNetwork(self.state_dim, self.action_dim, self.lr_q)
self.Q1_target.load_state_dict(self.Q1.state_dict())
self.Q2_target.load_state_dict(self.Q2.state_dict())
def choose_action(self, s):
with torch.no_grad():
action, log_prob = self.PI.sample(s)
return action, log_prob
def calc_target(self, mini_batch):
s, a, r, s_prime, done = mini_batch
with torch.no_grad():
a_prime, log_prob_prime = self.PI.sample(s_prime)
entropy = - self.log_alpha.exp() * log_prob_prime
q1_target, q2_target = self.Q1_target(s_prime, a_prime), self.Q2_target(s_prime, a_prime)
q_target = torch.min(q1_target, q2_target)
target = r + self.gamma * done * (q_target + entropy)
return target
def train_agent(self):
mini_batch = self.memory.sample(self.batch_size)
s_batch, a_batch, r_batch, s_prime_batch, done_batch = mini_batch
td_target = self.calc_target(mini_batch)
# Training of Q1
q1_loss = F.smooth_l1_loss(self.Q1(s_batch, a_batch), td_target)
self.Q1.optimizer.zero_grad()
q1_loss.mean().backward()
self.Q1.optimizer.step()
# Training of Q2
q2_loss = F.smooth_l1_loss(self.Q2(s_batch, a_batch), td_target)
self.Q2.optimizer.zero_grad()
q2_loss.mean().backward()
self.Q2.optimizer.step()
# Training of PI
a, log_prob = self.PI.sample(s_batch)
entropy = -self.log_alpha.exp() * log_prob
q1, q2 = self.Q1(s_batch, a), self.Q2(s_batch, a)
q = torch.min(q1, q2)
pi_loss = -(q + entropy) # For gradient ascent
self.PI.optimizer.zero_grad()
pi_loss.mean().backward()
self.PI.optimizer.step()
# Alpha train
self.log_alpha_optimizer.zero_grad()
alpha_loss = -(self.log_alpha.exp() * (log_prob + self.target_entropy).detach()).mean()
alpha_loss.backward()
self.log_alpha_optimizer.step()
# Soft update of Q1 and Q2
for param_target, param in zip(self.Q1_target.parameters(), self.Q1.parameters()):
param_target.data.copy_(param_target.data * (1.0 - self.tau) + param.data * self.tau)
for param_target, param in zip(self.Q2_target.parameters(), self.Q2.parameters()):
param_target.data.copy_(param_target.data * (1.0 - self.tau) + param.data * self.tau)def train_SACAgent():
# Initalize the SAC Agent and related variables required
agent = SACAgent()
env = gym.make('Pendulum-v1', g=9.81)
episodes = 800
total_rewards = []
no_of_steps = []
success_count = 0
frames = []
best_episode = 0
best_reward = float('-inf')
# Loop through the range of episodes
for episode in range(episodes):
state = env.reset()
score, done = 0.0, False
start_time = datetime.datetime.now()
counter = 0
while not done:
counter += 1
action, log_prob = agent.choose_action(torch.FloatTensor(state))
state_prime, reward, done, _ = env.step([action])
agent.memory.put((state, action, reward, state_prime, done))
score += reward
state = state_prime
if counter % 50 == 0 and score > -50:
screen = env.render(mode='rgb_array')
frames.append(screen)
if agent.memory.size() > 1000:
agent.train_agent()
# Recording results
if len(total_rewards) > 0:
success_count += (score - total_rewards[-1]) >= 200
total_rewards.append(score)
no_of_steps.append(counter)
if score > best_reward:
best_reward = score
best_episode = episode
# Saving the Models
save_folder = "SAC"
if not os.path.exists(save_folder):
os.makedirs(save_folder)
if episode == best_episode:
model_name = os.path.join(save_folder, "SAC" + str(episode) + ".pt")
torch.save(agent.PI.state_dict(), model_name)
if episode % 10 == 0:
elapsed_time = datetime.datetime.now() - start_time
print('Episode {:>4} | Total Reward: {:>8.2f} | Elapsed: {}'.format(episode, score, elapsed_time))
env.close()
return {
'total_rewards': total_rewards,
'no_of_steps': no_of_steps,
'success_count': success_count,
'frames': frames
}
SAC_results = train_SACAgent()Episode 0 | Total Reward: -1263.81 | Elapsed: 0:00:00.091711
Episode 10 | Total Reward: -1489.46 | Elapsed: 0:00:02.265242
Episode 20 | Total Reward: -389.96 | Elapsed: 0:00:02.632590
Episode 30 | Total Reward: -127.36 | Elapsed: 0:00:02.051764
Episode 40 | Total Reward: -131.92 | Elapsed: 0:00:02.080935
Episode 50 | Total Reward: -368.68 | Elapsed: 0:00:01.994790
Episode 60 | Total Reward: -125.87 | Elapsed: 0:00:02.099870
Episode 70 | Total Reward: -495.16 | Elapsed: 0:00:02.093062
Episode 80 | Total Reward: -3.40 | Elapsed: 0:00:02.169272
Episode 90 | Total Reward: -0.80 | Elapsed: 0:00:02.194396
Episode 100 | Total Reward: -373.32 | Elapsed: 0:00:02.114902
Episode 110 | Total Reward: -115.08 | Elapsed: 0:00:02.178001
Episode 120 | Total Reward: -126.53 | Elapsed: 0:00:02.228045
Episode 130 | Total Reward: -252.04 | Elapsed: 0:00:02.115674
Episode 140 | Total Reward: -248.56 | Elapsed: 0:00:02.073931
Episode 150 | Total Reward: -130.85 | Elapsed: 0:00:02.063166
Episode 160 | Total Reward: -3.86 | Elapsed: 0:00:02.060495
Episode 170 | Total Reward: -5.87 | Elapsed: 0:00:01.966251
Episode 180 | Total Reward: -378.04 | Elapsed: 0:00:02.164835
Episode 190 | Total Reward: -133.99 | Elapsed: 0:00:02.046350
Episode 200 | Total Reward: -322.09 | Elapsed: 0:00:02.146470
Episode 210 | Total Reward: -131.22 | Elapsed: 0:00:02.798415
Episode 220 | Total Reward: -130.28 | Elapsed: 0:00:02.116329
Episode 230 | Total Reward: -122.91 | Elapsed: 0:00:02.080042
Episode 240 | Total Reward: -126.82 | Elapsed: 0:00:02.457065
Episode 250 | Total Reward: -241.15 | Elapsed: 0:00:02.072398
Episode 260 | Total Reward: -135.84 | Elapsed: 0:00:02.127494
Episode 270 | Total Reward: -128.68 | Elapsed: 0:00:02.092727
Episode 280 | Total Reward: -132.87 | Elapsed: 0:00:01.963818
Episode 290 | Total Reward: -253.82 | Elapsed: 0:00:01.983134
Episode 300 | Total Reward: -5.20 | Elapsed: 0:00:02.007278
Episode 310 | Total Reward: -244.54 | Elapsed: 0:00:02.125452
Episode 320 | Total Reward: -133.17 | Elapsed: 0:00:02.101796
Episode 330 | Total Reward: -252.38 | Elapsed: 0:00:02.108584
Episode 340 | Total Reward: -251.01 | Elapsed: 0:00:02.068382
Episode 350 | Total Reward: -241.99 | Elapsed: 0:00:02.210062
Episode 360 | Total Reward: -246.01 | Elapsed: 0:00:02.027896
Episode 370 | Total Reward: -253.02 | Elapsed: 0:00:01.965153
Episode 380 | Total Reward: -130.73 | Elapsed: 0:00:01.942947
Episode 390 | Total Reward: -131.26 | Elapsed: 0:00:02.073364
Episode 400 | Total Reward: -246.84 | Elapsed: 0:00:02.130810
Episode 410 | Total Reward: -345.02 | Elapsed: 0:00:02.230273
Episode 420 | Total Reward: -0.68 | Elapsed: 0:00:02.209572
Episode 430 | Total Reward: -228.04 | Elapsed: 0:00:02.294913
Episode 440 | Total Reward: -131.99 | Elapsed: 0:00:02.166283
Episode 450 | Total Reward: -130.42 | Elapsed: 0:00:02.070453
Episode 460 | Total Reward: -246.70 | Elapsed: 0:00:02.073242
Episode 470 | Total Reward: -233.23 | Elapsed: 0:00:02.109055
Episode 480 | Total Reward: -130.46 | Elapsed: 0:00:02.173595
Episode 490 | Total Reward: -122.46 | Elapsed: 0:00:02.161290
Episode 500 | Total Reward: -121.94 | Elapsed: 0:00:02.142550
Episode 510 | Total Reward: -231.43 | Elapsed: 0:00:02.151427
Episode 520 | Total Reward: -3.00 | Elapsed: 0:00:02.182354
Episode 530 | Total Reward: -132.74 | Elapsed: 0:00:02.005636
Episode 540 | Total Reward: -2.03 | Elapsed: 0:00:02.191058
Episode 550 | Total Reward: -2.99 | Elapsed: 0:00:02.204246
Episode 560 | Total Reward: -1.44 | Elapsed: 0:00:02.168411
Episode 570 | Total Reward: -132.42 | Elapsed: 0:00:02.168530
Episode 580 | Total Reward: -220.00 | Elapsed: 0:00:02.069063
Episode 590 | Total Reward: -126.75 | Elapsed: 0:00:02.156660
Episode 600 | Total Reward: -239.90 | Elapsed: 0:00:02.087662
Episode 610 | Total Reward: -134.35 | Elapsed: 0:00:02.060851
Episode 620 | Total Reward: -131.85 | Elapsed: 0:00:02.062815
Episode 630 | Total Reward: -5.70 | Elapsed: 0:00:02.174400
Episode 640 | Total Reward: -125.27 | Elapsed: 0:00:02.184379
Episode 650 | Total Reward: -242.39 | Elapsed: 0:00:02.187821
Episode 660 | Total Reward: -241.77 | Elapsed: 0:00:02.161412
Episode 670 | Total Reward: -128.78 | Elapsed: 0:00:02.088581
Episode 680 | Total Reward: -5.30 | Elapsed: 0:00:02.113722
Episode 690 | Total Reward: -132.38 | Elapsed: 0:00:02.080776
Episode 700 | Total Reward: -122.94 | Elapsed: 0:00:02.136434
Episode 710 | Total Reward: -129.84 | Elapsed: 0:00:02.132908
Episode 720 | Total Reward: -6.86 | Elapsed: 0:00:02.192942
Episode 730 | Total Reward: -126.04 | Elapsed: 0:00:02.191723
Episode 740 | Total Reward: -118.76 | Elapsed: 0:00:02.141104
Episode 750 | Total Reward: -246.72 | Elapsed: 0:00:02.325031
Episode 760 | Total Reward: -127.43 | Elapsed: 0:00:02.198647
Episode 770 | Total Reward: -121.37 | Elapsed: 0:00:02.181811
Episode 780 | Total Reward: -2.57 | Elapsed: 0:00:02.231232
Episode 790 | Total Reward: -243.67 | Elapsed: 0:00:02.121809
VISUALIZING THE PERFORMANCE FOR THE SOFT ACTOR-CRITIC MODEL
# Calculating statistical measures
average_reward = np.mean(SAC_results['total_rewards'])
median_reward = np.median(SAC_results['total_rewards'])
max_reward = np.max(SAC_results['total_rewards'])
min_reward = np.min(SAC_results['total_rewards'])
# Identifying the best episode
best_episode_index = np.argmax(SAC_results['total_rewards'])
# Printing the Statistics
print("Performance Statistics for the SAC Model:")
print("--------------------------------------------")
print(f"Best Episode : {best_episode_index}")
print(f"Average Reward : {average_reward:.2f}")
print(f"Median Reward : {median_reward:.2f}")
print(f"Maximum Reward : {max_reward:.2f}")
print(f"Minimum Reward : {min_reward:.2f}")
# Plot the charts to show performance over time
plot_agent_performance(SAC_results['total_rewards'], average_reward, model_name="SAC DQN")Performance Statistics for the SAC Model:
--------------------------------------------
Best Episode : 42
Average Reward : -188.68
Median Reward : -131.26
Maximum Reward : -0.14
Minimum Reward : -1850.31

VIEWING THE MODEL ARCHITECTURE AND PENDULUM ANIMATION
.eval() function for PyTorch.# Load and view the model's architecture used for SAC
trained_model = SACAgent()
trained_model.PI.load_state_dict(torch.load("SAC/SAC42.pt"))
trained_model.PI.eval()PolicyNetwork(
(fc_1): Linear(in_features=3, out_features=64, bias=True)
(fc_2): Linear(in_features=64, out_features=64, bias=True)
(fc_mu): Linear(in_features=64, out_features=1, bias=True)
(fc_std): Linear(in_features=64, out_features=1, bias=True)
)
TESTING OUR MODEL WEIGHTS
Note there is no need to create a new model because the choose_action function of SAC does not make use of randomly generated numbers to encourage exploration
test_agent(trained_model, 'SAC')Test reward: -127.53441782760822

MODEL TRAINING EVOLUTION
# Visualizing the pendulum's animation
create_animation(SAC_results['frames']) # Visualizing the pendulum's animation
In this section, we will be performing an evaluation with 800 testing episodes for each model. For performance analysis and evaluation, we will be doing the following:
PERFORMING CALCULATIONS
In this section, we will first perform the calculations necessary to evaluate the performance of each model. The following steps are carried out:
class MetricsCalculator:
def __init__(self, total_rewards, no_of_steps, success_count, n_episodes, frames):
self.total_rewards = total_rewards
self.no_of_steps = no_of_steps
self.success_count = success_count
self.n_episodes = n_episodes
self.frames = frames
def avg_reward_per_episode(self):
sum_reward = np.sum(self.total_rewards)
return sum_reward / self.n_episodes
def std_reward_per_episode(self):
return np.std(self.total_rewards)
def avg_steps_taken(self):
step_count = np.sum(self.no_of_steps)
return step_count / self.n_episodes
def std_steps_taken(self):
return np.std(self.no_of_steps)
def avg_reward_per_step(self):
sum_reward = np.sum(self.total_rewards)
step_count = np.sum(self.no_of_steps)
return sum_reward / step_count
def success_rate(self):
return self.success_count / self.n_episodes
def render_frames(self):
create_animation(self.frames)
passDQN_metrics = MetricsCalculator(**DQN_results, n_episodes=800)
ImprovedDQN_metrics = MetricsCalculator(**ImprovedDQN_results, n_episodes=800)
DDQN_metrics = MetricsCalculator(**DDQN_results, n_episodes=800)
SAC_metrics = MetricsCalculator(**SAC_results, n_episodes=800)def create_dataframe_from_dict(data_dict, column_name=None):
df = pd.DataFrame.from_dict(data_dict, orient='index')
if column_name:
df.columns = [column_name]
return dfPLOTTING THE REWARD BAR PLOT
# Your dictionary
all_avg_reward_per_episode = {
'DQN': DQN_metrics.avg_reward_per_episode(),
'Improved DQN': ImprovedDQN_metrics.avg_reward_per_episode(),
'DDQN': DDQN_metrics.avg_reward_per_episode(),
'SAC': SAC_metrics.avg_reward_per_episode()
}
# Convert the dictionary to a DataFrame
df = create_dataframe_from_dict(all_avg_reward_per_episode, 'Avg_Reward_Per_Episode')
df| Avg_Reward_Per_Episode | |
|---|---|
| DQN | -340.944229 |
| Improved DQN | -545.086096 |
| DDQN | -569.351038 |
| SAC | -176.178789 |
SAC had the largest average reward per episode, indicating its impressive ability to consistently achieve high rewards. Surprisingly, DQN performed better in this evaluation than the Improved/Enhanced DQN.
# Sort the DataFrame by 'Avg_Reward_Per_Episode' in ascending order
df = df.sort_values(by='Avg_Reward_Per_Episode', ascending=True)
fig = plt.figure(figsize=(7, 4))
fig.suptitle(f"Average Reward")
ax = fig.subplots()
sns.barplot(
data=df,
y='Avg_Reward_Per_Episode',
x=df.index, # Swap x and y axes
ax=ax,
palette=sns.color_palette('Set2')
)
# ax.legend()
ax.set_ylabel('Avg Reward Per Episode') # Swap x and y axis labels
ax.set_xlabel('Model') # Swap x and y axis labels
plt.show()
PLOTTING THE SUCCESS RATE OF THE MODELS
# Your dictionary
all_success_rate = {
'DQN': DQN_metrics.success_rate(),
'Improved DQN': ImprovedDQN_metrics.success_rate(),
'DDQN': DDQN_metrics.success_rate(),
'SAC': SAC_metrics.success_rate()
}
# Convert the dictionary to a DataFrame
df = create_dataframe_from_dict(all_success_rate, 'success_rate')
df| success_rate | |
|---|---|
| DQN | 0.2450 |
| Improved DQN | 0.1525 |
| DDQN | 0.1575 |
| SAC | 0.0725 |
Success rate means "How often does a model improve on its previous results".
DQN had the higheset success rate because its training was quite irregular. However, whenever it performed badly, it was able to correct itself quickly in the next episode. It displays its inability to adapt to changing environments as it tries the same policy on a continuous environment which causes it to fail for that particular episode, but is able to learn from that and improve the very next episode.
SAC was the lowest in this evaluation because it achieve success and stability of attaining high rewards very early on which provided it with less opportunities to "bounce back" from unfavourable episodes.
# Sort the DataFrame by 'success_rate' in ascending order
df = df.sort_values(by='success_rate', ascending=True)
fig = plt.figure(figsize=(7, 4))
fig.suptitle(f"Success Rate")
ax = fig.subplots()
sns.barplot(
data=df,
y='success_rate',
x=df.index, # Swap x and y axes
ax=ax,
palette=sns.color_palette('Set2')
)
ax.set_ylabel('Success rate')
ax.set_xlabel('Model')
plt.show()
TWO SAMPLE INDEPENDENT T-TEST
Next, we perform a two sample independent t-test between the top two models to determine if there is any statistical significance. This is because although the mean of one model may be higher than another model, if the standard deviations are large enough, in fact the significance of that difference in mean may just be due to randomness.
Likewise, it may be possible that the means between two models appear very similar to each other, however they may actually be significantly different if the standard deviations are small. Although the large number of episodes may reduce the visual difference effect has on our analysis, it is much better to perform a statistical test.
Null Hypothesis(H0): Average results from the different models are identical
Alternate Hypothesis(H1): Average results from the different models are not identical
With a 95% confidence level, from the test results, all models but
one are different from each other (H0 is rejected). This is shown by the
very small p-value, except for Improved DQN and
DDQN which had a p-value of 0.09 (> 0.05), therefore H0
cannot be rejected.
import numpy as np
from scipy import stats
def two_sample_t_test(mean1, std1, mean2, std2, n1, n2):
t, p = stats.ttest_ind_from_stats(mean1, std1, n1, mean2, std2, n2)
return p
# Define the model metrics
dqn_avg = DQN_metrics.avg_reward_per_episode()
dqn_std = DQN_metrics.std_reward_per_episode()
improved_dqn_avg = ImprovedDQN_metrics.avg_reward_per_episode()
improved_dqn_std = ImprovedDQN_metrics.std_reward_per_episode()
ddqn_avg = DDQN_metrics.avg_reward_per_episode()
ddqn_std = DDQN_metrics.std_reward_per_episode()
sac_avg = SAC_metrics.avg_reward_per_episode()
sac_std = SAC_metrics.std_reward_per_episode()
# Sample sizes
n1 = 800
n2 = 800
# Perform two-sample t-tests and print the results
models = ["DQN", "Improved_DQN", "DDQN", "SAC"]
for i in range(len(models)):
for j in range(i+1, len(models)):
model1 = models[i]
model2 = models[j]
p_value = two_sample_t_test(
eval(f"{model1.lower()}_avg"),
eval(f"{model1.lower()}_std"),
eval(f"{model2.lower()}_avg"),
eval(f"{model2.lower()}_std"),
n1, n2
)
print(f"Two-sample t-test between {model1} and {model2}: p-value = {p_value}")Two-sample t-test between DQN and Improved_DQN: p-value = 1.8851302977853567e-45
Two-sample t-test between DQN and DDQN: p-value = 4.460181319774324e-55
Two-sample t-test between DQN and SAC: p-value = 1.4132384619600633e-40
Two-sample t-test between Improved_DQN and DDQN: p-value = 0.09470622155215883
Two-sample t-test between Improved_DQN and SAC: p-value = 2.3701112197964183e-152
Two-sample t-test between DDQN and SAC: p-value = 1.0529611551567295e-166
ANALYSIS OF MODEL EFFICIENCY
Now, we will be analyzing the efficiency of our models. We define efficiency as the ability to achieve more with less, which in our case would be asking "How good is a model at gaining rewards without the extensive use of steps?". We can assess this with two metrics:
Step Count. This metric is self explanatory, we track the number of steps each model takes. The higher score indicates which models on average take longer to finish, while vice versa for the lower.
Efficiency Score. We calculate this metric by performing:
$$\sum_{i=1}^{k}\frac{\text{reward}(i;\pi)}{\text{step}(i;\pi)}$$
where i represents the i’th testing episode, π represents the policy with reward and step providing the total reward and step count for the i’th testing episode, given policy π
Essentially, the higher the efficiency score, the better it is and the lower the efficiency score, the least efficient. Generally, we want to look for a model that has a high efficiency score with a low step count.
all_avg_reward_per_step = {
'DQN': DQN_metrics.avg_reward_per_step(),
'Improved DQN': ImprovedDQN_metrics.avg_reward_per_step(),
'DDQN': DDQN_metrics.avg_reward_per_step(),
'SAC': SAC_metrics.avg_reward_per_step()
}
# Convert the dictionary to a DataFrame
df = create_dataframe_from_dict(all_avg_reward_per_step, 'avg_reward_per_step')
df| avg_reward_per_step | |
|---|---|
| DQN | -1.819133 |
| Improved DQN | -2.990152 |
| DDQN | -3.121399 |
| SAC | -0.880894 |
Not surprisingly, SAC had the highest efficiency score, with its ability to understand the environment so quickly, achieve incredible results before the 100 episode mark. It is able to master this task with much less experience training, the embodiment of achieving more with less.
# Sort the DataFrame by 'Avg_Reward_Per_Episode' in ascending order
df = df.sort_values(by='avg_reward_per_step', ascending=True)
fig = plt.figure(figsize=(7, 4))
fig.suptitle(f"Efficiency Scores")
ax = fig.subplots()
sns.barplot(
data=df,
y='avg_reward_per_step',
x=df.index, # Swap x and y axes
ax=ax,
palette=sns.color_palette('Set2')
)
# ax.legend()
ax.set_ylabel('Avg reward per step')
ax.set_xlabel('Model')
plt.show()
Soft Actor-Critic Model and see if we
are able to further improve the performances of our best performing
model.MODIFYING THE SOFT ACTOR-CRITIC MODEL
__init__() function
of the classclass SACAgentTuning:
def __init__(
self,
state_dim=3,
action_dim=1,
lr_pi=0.001,
lr_q=0.001,
gamma=0.98,
batch_size=200,
buffer_limit=100000,
tau=0.005,
init_alpha=0.01,
lr_alpha=0.005,
):
self.state_dim = state_dim
self.action_dim = action_dim
self.lr_pi = lr_pi
self.lr_q = lr_q
self.gamma = gamma
self.batch_size = batch_size
self.buffer_limit = buffer_limit
self.tau = tau
self.init_alpha = init_alpha
self.target_entropy = -self.action_dim
self.lr_alpha = lr_alpha
self.memory = ReplayBuffer(self.buffer_limit)
self.log_alpha = torch.tensor(np.log(self.init_alpha))
self.log_alpha.requires_grad = True
self.log_alpha_optimizer = optim.Adam([self.log_alpha], lr=self.lr_alpha)
self.PI = PolicyNetwork(self.state_dim, self.action_dim, self.lr_pi)
self.Q1 = QNetwork(self.state_dim, self.action_dim, self.lr_q)
self.Q1_target = QNetwork(self.state_dim, self.action_dim, self.lr_q)
self.Q2 = QNetwork(self.state_dim, self.action_dim, self.lr_q)
self.Q2_target = QNetwork(self.state_dim, self.action_dim, self.lr_q)
self.Q1_target.load_state_dict(self.Q1.state_dict())
self.Q2_target.load_state_dict(self.Q2.state_dict())
def choose_action(self, s):
with torch.no_grad():
action, log_prob = self.PI.sample(s)
return action, log_prob
def calc_target(self, mini_batch):
s, a, r, s_prime, done = mini_batch
with torch.no_grad():
a_prime, log_prob_prime = self.PI.sample(s_prime)
entropy = - self.log_alpha.exp() * log_prob_prime
q1_target, q2_target = self.Q1_target(s_prime, a_prime), self.Q2_target(s_prime, a_prime)
q_target = torch.min(q1_target, q2_target)
target = r + self.gamma * done * (q_target + entropy)
return target
def train_agent(self):
mini_batch = self.memory.sample(self.batch_size)
s_batch, a_batch, r_batch, s_prime_batch, done_batch = mini_batch
td_target = self.calc_target(mini_batch)
# Training of Q1
q1_loss = F.smooth_l1_loss(self.Q1(s_batch, a_batch), td_target)
self.Q1.optimizer.zero_grad()
q1_loss.mean().backward()
self.Q1.optimizer.step()
# Training of Q2
q2_loss = F.smooth_l1_loss(self.Q2(s_batch, a_batch), td_target)
self.Q2.optimizer.zero_grad()
q2_loss.mean().backward()
self.Q2.optimizer.step()
# Training of PI
a, log_prob = self.PI.sample(s_batch)
entropy = -self.log_alpha.exp() * log_prob
q1, q2 = self.Q1(s_batch, a), self.Q2(s_batch, a)
q = torch.min(q1, q2)
pi_loss = -(q + entropy) # For gradient ascent
self.PI.optimizer.zero_grad()
pi_loss.mean().backward()
self.PI.optimizer.step()
# Alpha train
self.log_alpha_optimizer.zero_grad()
alpha_loss = -(self.log_alpha.exp() * (log_prob + self.target_entropy).detach()).mean()
alpha_loss.backward()
self.log_alpha_optimizer.step()
# Soft update of Q1 and Q2
for param_target, param in zip(self.Q1_target.parameters(), self.Q1.parameters()):
param_target.data.copy_(param_target.data * (1.0 - self.tau) + param.data * self.tau)
for param_target, param in zip(self.Q2_target.parameters(), self.Q2.parameters()):
param_target.data.copy_(param_target.data * (1.0 - self.tau) + param.data * self.tau)HYPERPARAMETER TUNING FUNCTION
def hp_tune_SACAgent(config):
# Initalize the DQN hp_Agent and related variables required
hp_agent = SACAgentTuning(**config)
env = gym.make("Pendulum-v1", g=9.81)
episodes = 800
total_rewards = []
no_of_steps = []
success_count = 0
best_reward = float('-inf')
# Get hypertuning checkpoint
if train.get_checkpoint():
loaded_checkpoint = train.get_checkpoint()
with loaded_checkpoint.as_directory() as loaded_checkpoint_dir:
model_state = torch.load(
os.path.join(loaded_checkpoint_dir, "checkpoint.pt")
)
hp_agent.load_state_dict(model_state)
# Loop through the range of episodes
for episode in range(episodes):
state = env.reset()
score, done = 0.0, False
counter = 0
while not done:
counter += 1
action, log_prob = hp_agent.choose_action(torch.FloatTensor(state))
state_prime, reward, done, _ = env.step([action])
hp_agent.memory.put((state, action, reward, state_prime, done))
score += reward
state = state_prime
if hp_agent.memory.size() > 1000:
hp_agent.train_agent()
# Recording results
if len(total_rewards) > 0:
success_count += (score - total_rewards[-1]) >= 200
total_rewards.append(score)
no_of_steps.append(counter)
if score > best_reward:
best_reward = score
# Saving Checkpoint
metrics = {
"avg_reward": np.mean(total_rewards),
}
with tempfile.TemporaryDirectory() as tempdir:
torch.save(
hp_agent.PI.state_dict(),
os.path.join(tempdir, "checkpoint.pt"),
)
train.report(metrics=metrics, checkpoint=Checkpoint.from_directory(tempdir))
env.close()RUNNING HYPERPARAMETER TUNING
ASHAScheduler which is an alias to
AsyncHyperBandScheduler. It is a scheduler used for
hyperparameter optimization in distributed machine learning and neural
architecture search (NAS). It efficiently manages multiple trials with
different hyperparameter configurations, uses early stopping, and is
designed for parallel and asynchronous execution, making it useful for
finding optimal hyperparameters while utilizing multiple computing
resources.search_space = {
"state_dim": 3, # Fixed for the environment
"action_dim": 1, # Example choices for action_dim
"lr_pi": tune.loguniform(1e-4, 0.1), # Loguniform search for lr_pi
"lr_q": tune.loguniform(1e-4, 0.1), # Loguniform search for lr_q
"gamma": tune.choice([0.95, 0.98, 0.99]), # choices for gamma
"batch_size": tune.choice([100, 200, 300]), # choices for batch_size
"buffer_limit": tune.choice([50000, 100000, 200000]), # Choices for buffer_limit
"tau": tune.uniform(0.001, 0.01), # Uniform search for tau
"init_alpha": tune.loguniform(1e-4, 0.1), # Loguniform search for init_alpha
"lr_alpha": tune.loguniform(1e-4, 0.1), # Loguniform search for lr_alpha
}
scheduler = ASHAScheduler(
max_t=800,
grace_period=1,
reduction_factor=2
)
tuner = tune.Tuner(
tune.with_resources(
tune.with_parameters(hp_tune_SACAgent),
resources={"cpu": 2}
),
tune_config=tune.TuneConfig(
metric="avg_reward",
mode="max",
scheduler=scheduler,
num_samples=10,
),
param_space=search_space,
)
results = tuner.fit()
best_trial = results.get_best_result("avg_reward", "max")
print(f"Best trial config: {best_trial.config}")
print(f"Best trial final average reward: {best_trial.metrics['avg_reward']}")| Current time: | 2024-01-28 08:56:43 |
| Running for: | 00:00:00.19 |
| Memory: | 12.7/15.2 GiB |
| Trial name | status | loc | batch_size | buffer_limit | gamma | init_alpha | lr_alpha | lr_pi | lr_q | tau |
|---|---|---|---|---|---|---|---|---|---|---|
| hp_tune_SACAgent_1a2e2_00000 | PENDING | 200 | 100000 | 0.98 | 0.0245411 | 0.021885 | 0.000551092 | 0.0178988 | 0.00284066 | |
| hp_tune_SACAgent_1a2e2_00001 | PENDING | 300 | 50000 | 0.95 | 0.0091431 | 0.0136917 | 0.0040095 | 0.00476492 | 0.00873231 | |
| hp_tune_SACAgent_1a2e2_00002 | PENDING | 200 | 200000 | 0.99 | 0.0347088 | 0.0153973 | 0.000388744 | 0.00658863 | 0.00809449 | |
| hp_tune_SACAgent_1a2e2_00003 | PENDING | 100 | 50000 | 0.95 | 0.000760478 | 0.000133917 | 0.0325391 | 0.00393464 | 0.00478841 | |
| hp_tune_SACAgent_1a2e2_00004 | PENDING | 200 | 50000 | 0.99 | 0.00702418 | 0.000293844 | 0.0819245 | 0.000352949 | 0.0087974 | |
| hp_tune_SACAgent_1a2e2_00005 | PENDING | 100 | 200000 | 0.95 | 0.00360496 | 0.035589 | 0.00308427 | 0.0183588 | 0.00899611 | |
| hp_tune_SACAgent_1a2e2_00006 | PENDING | 300 | 50000 | 0.99 | 0.00146163 | 0.00187331 | 0.000450539 | 0.00013053 | 0.00887594 | |
| hp_tune_SACAgent_1a2e2_00007 | PENDING | 100 | 50000 | 0.99 | 0.000231165 | 0.000205124 | 0.000133754 | 0.0013325 | 0.00743987 | |
| hp_tune_SACAgent_1a2e2_00008 | PENDING | 300 | 200000 | 0.98 | 0.0576165 | 0.00344381 | 0.0255 | 0.00927923 | 0.00424883 | |
| hp_tune_SACAgent_1a2e2_00009 | PENDING | 300 | 100000 | 0.95 | 0.00665678 | 0.00979623 | 0.054478 | 0.0014454 | 0.00730881 |
(hp_tune_SACAgent pid=22804) c:\Users\zzhen\anaconda3\envs\gpu_env\lib\site-packages\gym\core.py:317: DeprecationWarning: WARN: Initializing wrapper in old step API which returns one bool instead of two. It is recommended to set `new_step_api=True` to use new step API. This will be the default behaviour in future.
(hp_tune_SACAgent pid=22804) deprecation(
(hp_tune_SACAgent pid=22804) c:\Users\zzhen\anaconda3\envs\gpu_env\lib\site-packages\gym\wrappers\step_api_compatibility.py:39: DeprecationWarning: WARN: Initializing environment in old step API which returns one bool instead of two. It is recommended to set `new_step_api=True` to use new step API. This will be the default behaviour in future.
(hp_tune_SACAgent pid=22804) deprecation(
(hp_tune_SACAgent pid=22804) c:\Users\zzhen\anaconda3\envs\gpu_env\lib\site-packages\numpy\core\fromnumeric.py:43: FutureWarning: The input object of type 'Tensor' is an array-like implementing one of the corresponding protocols (`__array__`, `__array_interface__` or `__array_struct__`); but not a sequence (or 0-D). In the future, this object will be coerced as if it was first converted using `np.array(obj)`. To retain the old behaviour, you have to either modify the type 'Tensor', or assign to an empty array created with `np.empty(correct_shape, dtype=object)`.
(hp_tune_SACAgent pid=22804) result = getattr(asarray(obj), method)(*args, **kwds)
(hp_tune_SACAgent pid=22804) c:\Users\zzhen\anaconda3\envs\gpu_env\lib\site-packages\gym\utils\passive_env_checker.py:241: DeprecationWarning: `np.bool8` is a deprecated alias for `np.bool_`. (Deprecated NumPy 1.24)
(hp_tune_SACAgent pid=22804) if not isinstance(terminated, (bool, np.bool8)):
(hp_tune_SACAgent pid=22804) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00000_0_batch_size=200,buffer_limit=100000,gamma=0.9800,init_alpha=0.0245,lr_alpha=0.0219,lr_pi=0.0006,lr_q_2024-01-28_08-56-42/checkpoint_000000)
(hp_tune_SACAgent pid=12028) C:\Users\zzhen\AppData\Local\Temp\ipykernel_34388\552776624.py:21: UserWarning: Creating a tensor from a list of numpy.ndarrays is extremely slow. Please consider converting the list to a single numpy.ndarray with numpy.array() before converting to a tensor. (Triggered internally at C:\actions-runner\_work\pytorch\pytorch\builder\windows\pytorch\torch\csrc\utils\tensor_new.cpp:264.)
(hp_tune_SACAgent pid=9112) c:\Users\zzhen\anaconda3\envs\gpu_env\lib\site-packages\gym\core.py:317: DeprecationWarning: WARN: Initializing wrapper in old step API which returns one bool instead of two. It is recommended to set `new_step_api=True` to use new step API. This will be the default behaviour in future. [repeated 7x across cluster]
(hp_tune_SACAgent pid=9112) deprecation( [repeated 14x across cluster]
(hp_tune_SACAgent pid=9112) c:\Users\zzhen\anaconda3\envs\gpu_env\lib\site-packages\gym\wrappers\step_api_compatibility.py:39: DeprecationWarning: WARN: Initializing environment in old step API which returns one bool instead of two. It is recommended to set `new_step_api=True` to use new step API. This will be the default behaviour in future. [repeated 7x across cluster]
(hp_tune_SACAgent pid=9112) c:\Users\zzhen\anaconda3\envs\gpu_env\lib\site-packages\numpy\core\fromnumeric.py:43: FutureWarning: The input object of type 'Tensor' is an array-like implementing one of the corresponding protocols (`__array__`, `__array_interface__` or `__array_struct__`); but not a sequence (or 0-D). In the future, this object will be coerced as if it was first converted using `np.array(obj)`. To retain the old behaviour, you have to either modify the type 'Tensor', or assign to an empty array created with `np.empty(correct_shape, dtype=object)`. [repeated 7x across cluster]
(hp_tune_SACAgent pid=9112) result = getattr(asarray(obj), method)(*args, **kwds) [repeated 7x across cluster]
(hp_tune_SACAgent pid=9112) c:\Users\zzhen\anaconda3\envs\gpu_env\lib\site-packages\gym\utils\passive_env_checker.py:241: DeprecationWarning: `np.bool8` is a deprecated alias for `np.bool_`. (Deprecated NumPy 1.24) [repeated 7x across cluster]
(hp_tune_SACAgent pid=9112) if not isinstance(terminated, (bool, np.bool8)): [repeated 7x across cluster]
(hp_tune_SACAgent pid=10164) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00006_6_batch_size=300,buffer_limit=50000,gamma=0.9900,init_alpha=0.0015,lr_alpha=0.0019,lr_pi=0.0005,lr_q=_2024-01-28_08-56-42/checkpoint_000007) [repeated 26x across cluster]
(hp_tune_SACAgent pid=10164) C:\Users\zzhen\AppData\Local\Temp\ipykernel_34388\552776624.py:21: UserWarning: Creating a tensor from a list of numpy.ndarrays is extremely slow. Please consider converting the list to a single numpy.ndarray with numpy.array() before converting to a tensor. (Triggered internally at C:\actions-runner\_work\pytorch\pytorch\builder\windows\pytorch\torch\csrc\utils\tensor_new.cpp:264.)
(hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000011) [repeated 6x across cluster]
(hp_tune_SACAgent pid=10164) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00006_6_batch_size=300,buffer_limit=50000,gamma=0.9900,init_alpha=0.0015,lr_alpha=0.0019,lr_pi=0.0005,lr_q=_2024-01-28_08-56-42/checkpoint_000013) [repeated 7x across cluster]
(hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000018) [repeated 6x across cluster]
(hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000022) [repeated 4x across cluster]
(hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000026) [repeated 4x across cluster]
(hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000030) [repeated 4x across cluster]
(hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000034) [repeated 4x across cluster]
(hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000038) [repeated 4x across cluster]
(hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000042) [repeated 4x across cluster]
(hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000046) [repeated 4x across cluster]
(hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000050) [repeated 4x across cluster]
(hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000054) [repeated 4x across cluster]
(hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000058) [repeated 4x across cluster]
(hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000062) [repeated 4x across cluster]
(hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000066) [repeated 4x across cluster]
(hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000070) [repeated 4x across cluster]
(hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000074) [repeated 4x across cluster]
(hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000078) [repeated 4x across cluster]
(hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000082) [repeated 4x across cluster]
(hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000086) [repeated 4x across cluster]
(hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000090) [repeated 4x across cluster]
(hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000094) [repeated 4x across cluster]
(hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000098) [repeated 4x across cluster]
(hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000102) [repeated 4x across cluster]
(hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000106) [repeated 4x across cluster]
(hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000110) [repeated 4x across cluster]
(hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000114) [repeated 4x across cluster]
(hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000118) [repeated 4x across cluster]
(hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000122) [repeated 4x across cluster]
(hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000126) [repeated 4x across cluster]
(hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000130) [repeated 4x across cluster]
(hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000134) [repeated 4x across cluster]
(hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000138) [repeated 4x across cluster]
(hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000142) [repeated 4x across cluster]
(hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000146) [repeated 4x across cluster]
(hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000150) [repeated 4x across cluster]
(hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000154) [repeated 4x across cluster]
(hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000158) [repeated 4x across cluster]
(hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000161) [repeated 3x across cluster]
(hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000164) [repeated 3x across cluster]
(hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000167) [repeated 3x across cluster]
(hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000170) [repeated 3x across cluster]
(hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000173) [repeated 3x across cluster]
(hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000176) [repeated 3x across cluster]
(hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000180) [repeated 4x across cluster]
(hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000184) [repeated 4x across cluster]
(hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000188) [repeated 4x across cluster]
(hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000192) [repeated 4x across cluster]
(hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000196) [repeated 4x across cluster]
(hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000199) [repeated 3x across cluster]
(hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000203) [repeated 4x across cluster]
(hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000207) [repeated 4x across cluster]
(hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000211) [repeated 4x across cluster]
(hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000215) [repeated 4x across cluster]
(hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000219) [repeated 4x across cluster]
(hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000223) [repeated 4x across cluster]
(hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000227) [repeated 4x across cluster]
(hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000231) [repeated 4x across cluster]
(hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000235) [repeated 4x across cluster]
(hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000239) [repeated 4x across cluster]
(hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000243) [repeated 4x across cluster]
(hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000247) [repeated 4x across cluster]
(hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000251) [repeated 4x across cluster]
(hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000255) [repeated 4x across cluster]
(hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000259) [repeated 4x across cluster]
(hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000263) [repeated 4x across cluster]
(hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000267) [repeated 4x across cluster]
(hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000271) [repeated 4x across cluster]
(hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000275) [repeated 4x across cluster]
(hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000279) [repeated 4x across cluster]
(hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000283) [repeated 4x across cluster]
(hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000287) [repeated 4x across cluster]
(hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000291) [repeated 4x across cluster]
(hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000295) [repeated 4x across cluster]
(hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000299) [repeated 4x across cluster]
(hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000303) [repeated 4x across cluster]
(hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000307) [repeated 4x across cluster]
(hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000310) [repeated 3x across cluster]
(hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000313) [repeated 3x across cluster]
(hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000316) [repeated 3x across cluster]
(hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000320) [repeated 4x across cluster]
(hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000324) [repeated 4x across cluster]
(hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000328) [repeated 4x across cluster]
(hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000332) [repeated 4x across cluster]
(hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000336) [repeated 4x across cluster]
(hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000340) [repeated 4x across cluster]
(hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000344) [repeated 4x across cluster]
(hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000348) [repeated 4x across cluster]
(hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000351) [repeated 3x across cluster]
(hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000355) [repeated 4x across cluster]
(hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000358) [repeated 3x across cluster]
(hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000360) [repeated 2x across cluster]
(hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000363) [repeated 3x across cluster]
(hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000366) [repeated 3x across cluster]
(hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000369) [repeated 3x across cluster]
(hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000372) [repeated 3x across cluster]
(hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000375) [repeated 3x across cluster]
(hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000378) [repeated 3x across cluster]
(hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000382) [repeated 4x across cluster]
(hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000386) [repeated 4x across cluster]
(hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000390) [repeated 4x across cluster]
(hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000394) [repeated 4x across cluster]
(hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000398) [repeated 4x across cluster]
(hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000402) [repeated 4x across cluster]
(hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000406) [repeated 4x across cluster]
(hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000410) [repeated 4x across cluster]
(hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000414) [repeated 4x across cluster]
(hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000418) [repeated 4x across cluster]
(hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000421) [repeated 3x across cluster]
(hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000425) [repeated 4x across cluster]
(hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000429) [repeated 4x across cluster]
(hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000433) [repeated 4x across cluster]
(hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000437) [repeated 4x across cluster]
(hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000441) [repeated 4x across cluster]
(hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000445) [repeated 4x across cluster]
(hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000448) [repeated 3x across cluster]
(hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000451) [repeated 3x across cluster]
(hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000455) [repeated 4x across cluster]
(hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000459) [repeated 4x across cluster]
(hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000463) [repeated 4x across cluster]
(hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000467) [repeated 4x across cluster]
(hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000471) [repeated 4x across cluster]
(hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000475) [repeated 4x across cluster]
(hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000478) [repeated 3x across cluster]
(hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000482) [repeated 4x across cluster]
(hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000486) [repeated 4x across cluster]
(hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000490) [repeated 4x across cluster]
(hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000494) [repeated 4x across cluster]
(hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000497) [repeated 3x across cluster]
(hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000501) [repeated 4x across cluster]
(hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000505) [repeated 4x across cluster]
(hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000509) [repeated 4x across cluster]
(hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000513) [repeated 4x across cluster]
(hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000517) [repeated 4x across cluster]
(hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000520) [repeated 3x across cluster]
(hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000524) [repeated 4x across cluster]
(hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000528) [repeated 4x across cluster]
(hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000532) [repeated 4x across cluster]
(hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000536) [repeated 4x across cluster]
(hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000539) [repeated 3x across cluster]
(hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000543) [repeated 4x across cluster]
(hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000547) [repeated 4x across cluster]
(hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000551) [repeated 4x across cluster]
(hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000555) [repeated 4x across cluster]
(hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000559) [repeated 4x across cluster]
(hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000563) [repeated 4x across cluster]
(hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000567) [repeated 4x across cluster]
(hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000571) [repeated 4x across cluster]
(hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000575) [repeated 4x across cluster]
(hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000579) [repeated 4x across cluster]
(hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000583) [repeated 4x across cluster]
(hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000587) [repeated 4x across cluster]
(hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000591) [repeated 4x across cluster]
(hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000595) [repeated 4x across cluster]
(hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000599) [repeated 4x across cluster]
(hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000603) [repeated 4x across cluster]
(hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000607) [repeated 4x across cluster]
(hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000611) [repeated 4x across cluster]
(hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000615) [repeated 4x across cluster]
(hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000619) [repeated 4x across cluster]
(hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000623) [repeated 4x across cluster]
(hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000627) [repeated 4x across cluster]
(hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000631) [repeated 4x across cluster]
(hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000635) [repeated 4x across cluster]
(hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000639) [repeated 4x across cluster]
(hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000643) [repeated 4x across cluster]
(hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000647) [repeated 4x across cluster]
(hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000651) [repeated 4x across cluster]
(hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000655) [repeated 4x across cluster]
(hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000659) [repeated 4x across cluster]
(hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000663) [repeated 4x across cluster]
(hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000667) [repeated 4x across cluster]
(hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000671) [repeated 4x across cluster]
(hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000675) [repeated 4x across cluster]
(hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000679) [repeated 4x across cluster]
(hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000683) [repeated 4x across cluster]
(hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000687) [repeated 4x across cluster]
(hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000691) [repeated 4x across cluster]
(hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000695) [repeated 4x across cluster]
(hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000699) [repeated 4x across cluster]
(hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000703) [repeated 4x across cluster]
(hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000707) [repeated 4x across cluster]
(hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000711) [repeated 4x across cluster]
(hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000715) [repeated 4x across cluster]
(hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000719) [repeated 4x across cluster]
(hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000723) [repeated 4x across cluster]
(hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000726) [repeated 3x across cluster]
(hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000729) [repeated 3x across cluster]
(hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000732) [repeated 3x across cluster]
(hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000736) [repeated 4x across cluster]
(hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000740) [repeated 4x across cluster]
(hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000744) [repeated 4x across cluster]
(hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000748) [repeated 4x across cluster]
(hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000752) [repeated 4x across cluster]
(hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000756) [repeated 4x across cluster]
(hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000760) [repeated 4x across cluster]
(hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000764) [repeated 4x across cluster]
(hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000768) [repeated 4x across cluster]
(hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000772) [repeated 4x across cluster]
(hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000776) [repeated 4x across cluster]
(hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000780) [repeated 4x across cluster]
(hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000784) [repeated 4x across cluster]
(hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000788) [repeated 4x across cluster]
(hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000792) [repeated 4x across cluster]
(hp_tune_SACAgent pid=12028) Checkpoint successfully created at: Checkpoint(filesystem=local, path=C:/Users/zzhen/ray_results/hp_tune_SACAgent_2024-01-28_08-56-42/hp_tune_SACAgent_1a2e2_00002_2_batch_size=200,buffer_limit=200000,gamma=0.9900,init_alpha=0.0347,lr_alpha=0.0154,lr_pi=0.0004,lr_q_2024-01-28_08-56-42/checkpoint_000796) [repeated 4x across cluster]
2024-01-28 09:17:19,506 INFO tune.py:1042 -- Total run time: 1236.72 seconds (1236.64 seconds for the tuning loop).
Best trial config: {'state_dim': 3, 'action_dim': 1, 'lr_pi': 0.0003887437422389239, 'lr_q': 0.006588627430399412, 'gamma': 0.99, 'batch_size': 200, 'buffer_limit': 200000, 'tau': 0.008094487127446998, 'init_alpha': 0.03470881719479883, 'lr_alpha': 0.015397298925206759}
Best trial final total reward: -169.30299748599347
BEST MODEL AFTER TUNING
best_hp_agent = SACAgentTuning(**best_trial.config)EVALUATING HYPERTUNED MODEL
def train_best_SACAgent(best_hp_agent:SACAgentTuning):
# Initalize the SAC Agent and related variables required
agent = best_hp_agent
env = gym.make('Pendulum-v1', g=9.81)
episodes = 800
total_rewards = []
no_of_steps = []
success_count = 0
frames = []
best_episode = 0
best_reward = float('-inf')
# Loop through the range of episodes
for episode in range(episodes):
state = env.reset()
score, done = 0.0, False
start_time = datetime.datetime.now()
counter = 0
while not done:
counter += 1
action, log_prob = agent.choose_action(torch.FloatTensor(state))
state_prime, reward, done, _ = env.step([action])
agent.memory.put((state, action, reward, state_prime, done))
score += reward
state = state_prime
if counter % 50 == 0 and score > -50:
screen = env.render(mode='rgb_array')
frames.append(screen)
if agent.memory.size() > 1000:
agent.train_agent()
# Recording results
if len(total_rewards) > 0:
success_count += (score - total_rewards[-1]) >= 200 or score > -2
total_rewards.append(score)
no_of_steps.append(counter)
if score > best_reward:
best_reward = score
best_episode = episode
# Saving the Models
save_folder = "Tuned_SAC"
if not os.path.exists(save_folder):
os.makedirs(save_folder)
if episode == best_episode:
model_name = os.path.join(save_folder, "Tuned_SAC" + str(episode) + ".pt")
torch.save(agent.PI.state_dict(), model_name)
if episode % 10 == 0:
elapsed_time = datetime.datetime.now() - start_time
print('Episode {:>4} | Total Reward: {:>8.2f} | Elapsed: {}'.format(episode, score, elapsed_time))
env.close()
return {
'total_rewards': total_rewards,
'no_of_steps': no_of_steps,
'success_count': success_count,
'frames': frames
}
tuned_SAC_results = train_best_SACAgent(best_hp_agent)Episode 0 | Total Reward: -887.64 | Elapsed: 0:00:00.055890
Episode 10 | Total Reward: -665.24 | Elapsed: 0:00:02.148514
Episode 20 | Total Reward: -131.11 | Elapsed: 0:00:01.387094
Episode 30 | Total Reward: -370.03 | Elapsed: 0:00:01.326462
Episode 40 | Total Reward: -234.36 | Elapsed: 0:00:01.342169
Episode 50 | Total Reward: -120.40 | Elapsed: 0:00:01.368868
Episode 60 | Total Reward: -1.85 | Elapsed: 0:00:01.395959
Episode 70 | Total Reward: -122.71 | Elapsed: 0:00:01.352037
Episode 80 | Total Reward: -126.02 | Elapsed: 0:00:01.462505
Episode 90 | Total Reward: -121.53 | Elapsed: 0:00:01.382721
Episode 100 | Total Reward: -237.28 | Elapsed: 0:00:01.378617
Episode 110 | Total Reward: -356.86 | Elapsed: 0:00:01.403869
Episode 120 | Total Reward: -126.44 | Elapsed: 0:00:01.356247
Episode 130 | Total Reward: -130.07 | Elapsed: 0:00:01.407759
Episode 140 | Total Reward: -245.72 | Elapsed: 0:00:01.416777
Episode 150 | Total Reward: -229.19 | Elapsed: 0:00:01.447425
Episode 160 | Total Reward: -117.94 | Elapsed: 0:00:01.322713
Episode 170 | Total Reward: -328.76 | Elapsed: 0:00:01.434287
Episode 180 | Total Reward: -248.94 | Elapsed: 0:00:01.437368
Episode 190 | Total Reward: -232.39 | Elapsed: 0:00:01.411170
Episode 200 | Total Reward: -224.91 | Elapsed: 0:00:01.512126
Episode 210 | Total Reward: -236.58 | Elapsed: 0:00:01.390656
Episode 220 | Total Reward: -127.59 | Elapsed: 0:00:01.421051
Episode 230 | Total Reward: -1.18 | Elapsed: 0:00:01.456356
Episode 240 | Total Reward: -1.49 | Elapsed: 0:00:01.440719
Episode 250 | Total Reward: -122.12 | Elapsed: 0:00:01.414578
Episode 260 | Total Reward: -125.92 | Elapsed: 0:00:01.506192
Episode 270 | Total Reward: -129.42 | Elapsed: 0:00:01.422304
Episode 280 | Total Reward: -11.53 | Elapsed: 0:00:01.502813
Episode 290 | Total Reward: -135.59 | Elapsed: 0:00:01.443406
Episode 300 | Total Reward: -121.21 | Elapsed: 0:00:01.478148
Episode 310 | Total Reward: -122.59 | Elapsed: 0:00:01.431704
Episode 320 | Total Reward: -231.50 | Elapsed: 0:00:01.484078
Episode 330 | Total Reward: -119.16 | Elapsed: 0:00:01.394267
Episode 340 | Total Reward: -125.35 | Elapsed: 0:00:01.460871
Episode 350 | Total Reward: -124.25 | Elapsed: 0:00:01.465501
Episode 360 | Total Reward: -117.02 | Elapsed: 0:00:01.420217
Episode 370 | Total Reward: -232.21 | Elapsed: 0:00:01.431042
Episode 380 | Total Reward: -237.55 | Elapsed: 0:00:01.431311
Episode 390 | Total Reward: -344.62 | Elapsed: 0:00:01.429877
Episode 400 | Total Reward: -129.17 | Elapsed: 0:00:01.467225
Episode 410 | Total Reward: -235.88 | Elapsed: 0:00:01.484058
Episode 420 | Total Reward: -128.39 | Elapsed: 0:00:01.464355
Episode 430 | Total Reward: -114.32 | Elapsed: 0:00:01.441448
Episode 440 | Total Reward: -117.72 | Elapsed: 0:00:01.445694
Episode 450 | Total Reward: -227.68 | Elapsed: 0:00:01.454484
Episode 460 | Total Reward: -125.78 | Elapsed: 0:00:01.412873
Episode 470 | Total Reward: -246.13 | Elapsed: 0:00:01.456214
Episode 480 | Total Reward: -121.11 | Elapsed: 0:00:01.464099
Episode 490 | Total Reward: -122.82 | Elapsed: 0:00:01.452174
Episode 500 | Total Reward: -246.19 | Elapsed: 0:00:01.527423
Episode 510 | Total Reward: -122.98 | Elapsed: 0:00:01.512263
Episode 520 | Total Reward: -225.45 | Elapsed: 0:00:01.514035
Episode 530 | Total Reward: -119.67 | Elapsed: 0:00:01.442444
Episode 540 | Total Reward: -127.23 | Elapsed: 0:00:01.546772
Episode 550 | Total Reward: -119.20 | Elapsed: 0:00:01.459459
Episode 560 | Total Reward: -126.79 | Elapsed: 0:00:01.508984
Episode 570 | Total Reward: -224.91 | Elapsed: 0:00:01.478620
Episode 580 | Total Reward: -123.65 | Elapsed: 0:00:01.483825
Episode 590 | Total Reward: -123.34 | Elapsed: 0:00:01.468968
Episode 600 | Total Reward: -128.07 | Elapsed: 0:00:01.514391
Episode 610 | Total Reward: -0.92 | Elapsed: 0:00:01.528736
Episode 620 | Total Reward: -121.43 | Elapsed: 0:00:01.467581
Episode 630 | Total Reward: -120.26 | Elapsed: 0:00:01.576493
Episode 640 | Total Reward: -239.47 | Elapsed: 0:00:01.495190
Episode 650 | Total Reward: -340.64 | Elapsed: 0:00:01.584836
Episode 660 | Total Reward: -127.64 | Elapsed: 0:00:01.542311
Episode 670 | Total Reward: -238.21 | Elapsed: 0:00:01.695179
Episode 680 | Total Reward: -223.48 | Elapsed: 0:00:01.623173
Episode 690 | Total Reward: -1.11 | Elapsed: 0:00:01.575105
Episode 700 | Total Reward: -4.74 | Elapsed: 0:00:01.554986
Episode 710 | Total Reward: -128.50 | Elapsed: 0:00:01.606051
Episode 720 | Total Reward: -120.96 | Elapsed: 0:00:01.527371
Episode 730 | Total Reward: -126.23 | Elapsed: 0:00:01.557759
Episode 740 | Total Reward: -127.12 | Elapsed: 0:00:01.543473
Episode 750 | Total Reward: -341.28 | Elapsed: 0:00:01.555745
Episode 760 | Total Reward: -127.68 | Elapsed: 0:00:01.679928
Episode 770 | Total Reward: -225.67 | Elapsed: 0:00:01.607428
Episode 780 | Total Reward: -120.50 | Elapsed: 0:00:01.576473
Episode 790 | Total Reward: -117.61 | Elapsed: 0:00:01.671018
# Calculating statistical measures
average_reward = np.mean(tuned_SAC_results['total_rewards'])
median_reward = np.median(tuned_SAC_results['total_rewards'])
max_reward = np.max(tuned_SAC_results['total_rewards'])
min_reward = np.min(tuned_SAC_results['total_rewards'])
# Identifying the best episode
best_episode_index = np.argmax(tuned_SAC_results['total_rewards'])
# Printing the Statistics
print("Performance Statistics for the SAC Model:")
print("--------------------------------------------")
print(f"Best Episode : {best_episode_index}")
print(f"Average Reward : {average_reward:.2f}")
print(f"Median Reward : {median_reward:.2f}")
print(f"Maximum Reward : {max_reward:.2f}")
print(f"Minimum Reward : {min_reward:.2f}")
# Plot the charts to show performance over time
plot_agent_performance(tuned_SAC_results['total_rewards'], average_reward, model_name="SAC DQN")Performance Statistics for the SAC Model:
--------------------------------------------
Best Episode : 789
Average Reward : -168.98
Median Reward : -125.16
Maximum Reward : -0.24
Minimum Reward : -1604.26

TESTING OUR MODEL WEIGHTS
config = {
"state_dim": 3,
"action_dim": 1,
"lr_pi": 0.0003887437422389239,
"lr_q": 0.006588627430399412,
"gamma": 0.99,
"batch_size": 200,
"buffer_limit": 200000,
"tau": 0.008094487127446998,
"init_alpha": 0.03470881719479883,
"lr_alpha": 0.015397298925206759,
}
agent = SACAgentTuning(**config)
agent.PI.load_state_dict(torch.load('./Tuned_SAC/Tuned_SAC789.pt'))
test_agent(agent, 'SAC')
MODEL TRAINING EVOLUTION
# Visualizing the pendulum's animation
create_animation(tuned_SAC_results['frames']) # Visualizing the pendulum's animation
We will be evaluating the hypertuned SAC model with the rest of the models with the same metrics we did earlier to understand if the tuned model indeed performed better.
tuned_SAC_metrics = MetricsCalculator(**tuned_SAC_results, n_episodes=800)
all_avg_reward_per_episode = {
'DQN': DQN_metrics.avg_reward_per_episode(),
'Improved DQN': ImprovedDQN_metrics.avg_reward_per_episode(),
'DDQN': DDQN_metrics.avg_reward_per_episode(),
'SAC': SAC_metrics.avg_reward_per_episode(),
'Tuned SAC': tuned_SAC_metrics.avg_reward_per_episode(),
}
df = create_dataframe_from_dict(all_avg_reward_per_episode, 'Avg_Reward_Per_Episode')
df| Avg_Reward_Per_Episode | |
|---|---|
| DQN | -340.944229 |
| Improved DQN | -545.086096 |
| DDQN | -569.351038 |
| SAC | -176.178789 |
| Tuned SAC | -168.979288 |
# Sort the DataFrame by 'Avg_Reward_Per_Episode' in ascending order
df = df.sort_values(by='Avg_Reward_Per_Episode', ascending=True)
fig = plt.figure(figsize=(7, 4))
fig.suptitle(f"Average Reward")
ax = fig.subplots()
sns.barplot(
data=df,
y='Avg_Reward_Per_Episode',
x=df.index,
ax=ax,
palette=sns.color_palette('Set2')
)
# ax.legend()
ax.set_ylabel('Avg Reward Per Episode')
ax.set_xlabel('Model')
plt.show()
all_success_rate = {
'DQN': DQN_metrics.success_rate(),
'Improved DQN': ImprovedDQN_metrics.success_rate(),
'DDQN': DDQN_metrics.success_rate(),
'SAC': SAC_metrics.success_rate(),
'Tuned SAC': tuned_SAC_metrics.success_rate(),
}
# Convert the dictionary to a DataFrame
df = create_dataframe_from_dict(all_success_rate, 'success_rate')
df| success_rate | |
|---|---|
| DQN | 0.2450 |
| Improved DQN | 0.1525 |
| DDQN | 0.1575 |
| SAC | 0.0725 |
| Tuned SAC | 0.1100 |
# Sort the DataFrame by 'success_rate' in ascending order
df = df.sort_values(by='success_rate', ascending=True)
fig = plt.figure(figsize=(7, 4))
fig.suptitle(f"Improvement Rate")
ax = fig.subplots()
sns.barplot(
data=df,
y='success_rate',
x=df.index, # Swap x and y axes
ax=ax,
palette=sns.color_palette('Set2')
)
ax.set_ylabel('Improvement rate')
ax.set_xlabel('Model')
plt.show()
all_avg_reward_per_step = {
'DQN': DQN_metrics.avg_reward_per_step(),
'Improved DQN': ImprovedDQN_metrics.avg_reward_per_step(),
'DDQN': DDQN_metrics.avg_reward_per_step(),
'SAC': SAC_metrics.avg_reward_per_step(),
'Tuned SAC': tuned_SAC_metrics.avg_reward_per_step(),
}
# Convert the dictionary to a DataFrame
df = create_dataframe_from_dict(all_avg_reward_per_step, 'avg_reward_per_step')# Sort the DataFrame by 'Avg_Reward_Per_Episode' in ascending order
df = df.sort_values(by='avg_reward_per_step', ascending=True)
fig = plt.figure(figsize=(7, 4))
fig.suptitle(f"Efficiency Scores")
ax = fig.subplots()
sns.barplot(
data=df,
y='avg_reward_per_step',
x=df.index, # Swap x and y axes
ax=ax,
palette=sns.color_palette('Set2')
)
# ax.legend()
ax.set_ylabel('Avg reward per step')
ax.set_xlabel('Model')
plt.show()
Reinforcement learning is a powerful and promising field in the area of Artificial Intelligence. With its ability to learn through trial and error and make decisions in dynamic environments, it has been successfully applied to various problems, ranging from gaming to robotics. Pendulum is a classic example of how reinforcement learning can be used to solve classic control problems in a simulated environment.
We have successfully tackled the Pendulum problem through the use of Reinforcement Learning algorithms, namely DQN, DDQN and SAC. Through this project, we have evaluated the models on various aspects such as performance, efficiency, robustness and feature importance. Our findings have provided valuable insights into the behavior of the algorithms and the intricacies of Reinforcement Learning.
This project has been a challenging yet enlightening experience and has helped us to gain a deeper understanding of Reinforcement Learning concepts. We hope that our work can contribute to the development of more advanced Reinforcement Learning models in the future.